In [1]:
import pandas as pd
import numpy as np

In [31]:
df_fake = pd.read_csv('news.csv', encoding= 'unicode_escape')
df_fake.head(15)

Unnamed: 0,unit_id,article_title,article_content,source,date,location,labels
0,1914947530,Syria attack symptoms consistent with nerve ag...,Wed 05 Apr 2017 Syria attack symptoms consiste...,nna,04-05-2017,idlib,0
1,1914947532,Homs governor says U.S. attack caused deaths b...,Fri 07 Apr 2017 at 0914 Homs governor says U.S...,nna,04-07-2017,homs,0
2,1914947533,Death toll from Aleppo bomb attack at least 112,Sun 16 Apr 2017 Death toll from Aleppo bomb at...,nna,4/16/2017,aleppo,0
3,1914947534,Aleppo bomb blast kills six Syrian state TV,Wed 19 Apr 2017 Aleppo bomb blast kills six Sy...,nna,4/19/2017,aleppo,0
4,1914947535,29 Syria Rebels Dead in Fighting for Key Alepp...,Sun 10 Jul 2016 29 Syria Rebels Dead in Fighti...,nna,07-10-2016,aleppo,0
5,1914947536,Suicide bombing kills at least 16 in northeast...,Tue 05 Jul 2016 Suicide bombing kills at least...,nna,07-05-2016,hasakeh,0
6,1914947537,22 dead in heavy U.S. raids on IS Syria strong...,Sun 05 Jul 2015 22 dead in heavy U.S. raids on...,nna,07-05-2015,raqqa,0
7,1914947538,Suicide bomber kills 4 in Assad clans hometown,Sun 22 Feb 2015 Suicide bomber kills 4 in Assa...,nna,2/22/2015,lattakia,0
8,1914947539,Explosion rocks down town Damascus,Sun 01 Feb 2015 Explosion rocks down town Dama...,nna,02-01-2015,damascus,1
9,1914947540,Damascus explosion due to rocket bomb,Sat 24 Aug 2013 Damascus explosion due to rock...,nna,8/24/2013,damascus,0


In [32]:
df_fake = df_fake.drop(['unit_id','source','date','location'],axis=1)
df_fake = df_fake.dropna()

In [29]:
df_fake.head(5)

Unnamed: 0,article_title,article_content,labels
0,Syria attack symptoms consistent with nerve ag...,Wed 05 Apr 2017 Syria attack symptoms consiste...,0
1,Homs governor says U.S. attack caused deaths b...,Fri 07 Apr 2017 at 0914 Homs governor says U.S...,0
2,Death toll from Aleppo bomb attack at least 112,Sun 16 Apr 2017 Death toll from Aleppo bomb at...,0
3,Aleppo bomb blast kills six Syrian state TV,Wed 19 Apr 2017 Aleppo bomb blast kills six Sy...,0
4,29 Syria Rebels Dead in Fighting for Key Alepp...,Sun 10 Jul 2016 29 Syria Rebels Dead in Fighti...,0


In [6]:
df_fake = df_fake[0:500]

In [7]:
X =df_fake.iloc[:,:-1].values # independent features -> article_title & article_content
y =df_fake.iloc[:,-1].values  # dependent feature -> label

In [None]:
X[0]

In [9]:
y[0]

0

In [10]:
from sklearn.feature_extraction.text import CountVectorizer #CountVectorizer is a useful tool for text analysis and is commonly used to convert a collection of text documents into a matrix of token counts.
cv = CountVectorizer(max_features = 1000)
mat_body = cv.fit_transform(X[:,1]).todense() #This will transform the text data in that column into a dense matrix representation using the fit-transform process 'mat_body' <- represents the transformed matrix of token counts.

By setting `max_features` to 1000, the `CountVectorizer` will only consider the top 1000 most frequent words or terms in the text data. This helps in limiting the dimensionality of the resulting matrix and can be beneficial in scenarios where you have a large corpus of text and want to focus on the most informative features.


In [11]:
mat_body

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0],
        ...,
        [1, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [12]:
cv_head = CountVectorizer(max_features = 1000)
mat_head = cv_head.fit_transform(X[:,0]).todense()

The result of the `fit_transform` operation is stored in the variable `mat_head`. This variable represents the transformed matrix of token counts, where each row corresponds to a document (or text sample) and each column represents a specific term or word from the vocabulary.

The `todense()` method is then used to convert the sparse matrix `mat_head` into a dense matrix representation. The dense matrix stores all the elements explicitly, as opposed to a sparse matrix where only non-zero elements are stored explicitly, resulting in a potentially larger memory footprint.

Overall, this code snippet performs the process of tokenizing and counting the occurrences of words in the text data, creating a dense matrix representation for further analysis or modeling tasks.


In [13]:
mat_head

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [14]:
X_mat = np.hstack((mat_head, mat_body))

The `np.hstack()` function is used to horizontally stack or concatenate two matrices, `mat_head` and `mat_body`, into a single matrix. This operation combines the columns of the two matrices side by side. Assuming both `mat_head` and `mat_body` are NumPy arrays or matrix-like objects, the resulting matrix `X_mat` will have the same number of rows as the original matrices. The number of columns in `X_mat` will be the sum of the number of columns in `mat_head` and `mat_body`.

The purpose of stacking or concatenating the matrices horizontally is to combine the features or representations derived from different sources (in this case, `mat_head` and `mat_body`). This can be useful in machine learning or data analysis tasks where multiple sets of features need to be considered together for modeling or further analysis.


In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_mat,y,test_size=0.2, random_state=0)

The `train_test_split` function is commonly used to split a dataset into training and testing sets for machine learning tasks. It randomly divides the data into two parts based on the specified `test_size` parameter, which determines the proportion of the data to be allocated for testing.

The inputs to the `train_test_split` function are `X_mat` and `y`, representing the feature matrix and target variable, respectively.

The outputs of the function are assigned to four variables: `X_train`, `X_test`, `y_train`, and `y_test`. The `X_train` and `y_train` variables will contain the training data, while `X_test` and `y_test` will hold the testing data.

By specifying `test_size=0.2`, 20% of the data will be allocated for testing, while the remaining 80% will be used for training. The `random_state` parameter is set to 0, which ensures that the random shuffling and splitting of the data will be reproducible.


In [20]:
from sklearn.tree import DecisionTreeClassifier

An instance of `DecisionTreeClassifier` named `dtc` is created with the parameter `criterion` set to 'entropy'. The 'entropy' criterion is used to measure the quality of a split in the decision tree based on the information gain.

Next, the `X_train` and `y_train` variables are converted to NumPy arrays using the `np.asarray()` function, and then further converted to one-dimensional arrays using the `np.squeeze()` function. This is done to ensure that the dimensions of the input arrays match the expected format for training the decision tree classifier.

Similarly, the `X_test` array is also converted to a one-dimensional array using `np.squeeze()`.

Finally, the `fit()` method of the `dtc` object is called with `X_train` and `y_train` as the training data. This trains the decision tree classifier on the provided training dataset.


In [None]:
dtc = DecisionTreeClassifier(criterion='entropy')

By choosing the 'entropy' criterion for the decision tree classifier, as in the code snippet you provided, the algorithm will use entropy as the measure of impurity to guide the construction of the decision tree. It will aim to make splits that maximize the information gain, reducing the entropy and improving the purity of the resulting subsets at each step of the tree-building process.


In [None]:
X_train = np.squeeze(np.asarray(X_train))
y_train = np.squeeze(np.asarray(y_train))
X_test = np.squeeze(np.asarray(X_test))
dtc.fit(X_train, y_train)

In [23]:
y_pred = dtc.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

The `confusion_matrix` function is commonly used in **classification tasks** to evaluate the performance of a machine learning model by comparing the predicted labels (`y_pred`) with the true labels (`y_test`).

The function takes two arguments: `y_test` and `y_pred`. `y_test` represents the true labels of the test data, while `y_pred` represents the predicted labels generated by the trained model.

When executed, the `confusion_matrix` function will compute a confusion matrix based on the provided labels. The confusion matrix is a table that summarizes the performance of a classification model, showing the counts of true positive, true negative, false positive, and false negative predictions.

The output of the `confusion_matrix` function will be a 2-dimensional array representing the confusion matrix. The rows of the matrix correspond to the true labels, and the columns correspond to the predicted labels. Each element of the matrix represents the count of data points falling into a particular combination of true and predicted labels.

By examining the confusion matrix, you can gain insights into the model's performance, including the accuracy, precision, recall, and other evaluation metrics derived from the counts in the matrix.


In [28]:
print(cm)

[[23 24]
 [21 32]]
