### Model Exercises for NLP

Take the work we did in the lessons further:
- What other types of models (i.e. different classification algorithms) could you use?
- How do the models compare when trained on term frequency data alone, instead of TF-IDF values?

In [1]:
# imports 
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import unicodedata
import re
import acquire, prepare
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, precision_score

## Can we build a NLP model that predicts the news category based on the text of the article?

In [2]:
# acquire the dataframe of news articles 
news = acquire.acquire_news_articles()
news.head()

Unnamed: 0,title,content,category
0,"Godrej, PwC, Deloitte India give extra offs to...",Several companies in India have been offering ...,business
1,"Bill Gates' company Cascade transfers ₹13,300 ...","Bill Gates' Cascade Investment, a holding comp...",business
2,RIL may soon fly in Israeli experts to install...,Reliance Industries has sought permission to f...,business
3,Second COVID-19 wave hit India like a tsunami:...,Biocon Founder Kiran Mazumdar-Shaw said that t...,business
4,China flight halt in India may hurt pharma sup...,The Indian Drug Manufacturers' Association (ID...,business


In [3]:
news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     149 non-null    object
 1   content   149 non-null    object
 2   category  149 non-null    object
dtypes: object(3)
memory usage: 3.6+ KB


In [4]:
news.category.value_counts()

business         25
world            25
technology       25
sports           25
entertainment    25
science          24
Name: category, dtype: int64

Takeaways:
- 149 articles
- six news categories
- balanced data set

In [5]:
def clean(text):
    'A simple function to cleanup text data'
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english') 
    text = (unicodedata.normalize('NFKD', text)
             .encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', '', text).split()
    # Return a joined string
    return " ".join([wnl.lemmatize(word) for word in words if word not in stopwords])



In [6]:
# prepare the data by applying the clean function
news.content = news.content.apply(clean)
news.head()

Unnamed: 0,title,content,category
0,"Godrej, PwC, Deloitte India give extra offs to...",several company india offering extra holiday e...,business
1,"Bill Gates' company Cascade transfers ₹13,300 ...",bill gate cascade investment holding company g...,business
2,RIL may soon fly in Israeli experts to install...,reliance industry sought permission fly israel...,business
3,Second COVID-19 wave hit India like a tsunami:...,biocon founder kiran mazumdarshaw said second ...,business
4,China flight halt in India may hurt pharma sup...,indian drug manufacturer association idma warn...,business


In [7]:
#function to split the data
def split(df, stratify_by=None):
    """
    3 way split for train, validate, and test datasets
    To stratify, send in a column name
    """
    train, test = train_test_split(df, test_size=.2, random_state=123, stratify=df[stratify_by])
       
    train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train[stratify_by])
    
    return train, validate, test

In [8]:
# Split the data for modeling
train, validate, test = split(news, 'category')
train.head()

Unnamed: 0,title,content,category
137,"Trump reportedly creates new Twitter account, ...",twitter suspended account called djtdesk repor...,world
83,My brother will watch me from above: Nikki on ...,former bigg bos 14 contestant nikki tamboli lo...,entertainment
43,Jose Mourinho named Italian club Roma's head c...,jose mourinho appointed italian club roma head...,sports
13,Pay your fair share of taxes: US lawmaker Jaya...,u lawmaker pramila jayapal responded tesla ceo...,business
51,Jeff Bezos sells $2.5 bn of Amazon shares in h...,amazon founder outgoing ceo jeff bezos sold 25...,technology


In [9]:
# setup X-train, validate and test variables
X_train = train.content
X_validate = validate.content
X_test = test.content

# setup y-train, validate and test variables
y_train = train.category
y_validate = validate.category
y_test = test.category

In [10]:
X_train.shape, X_validate.shape, X_test.shape

((83,), (36,), (30,))

In [11]:
y_train.shape, y_validate.shape, y_test.shape

((83,), (36,), (30,))

In [12]:
# Create the tfidf vectorizer object
# Step 1, this creates a tf-idf values for each word, for each document
# Step 2, encodes these values so that we can use models that only work on numbers, like classifications model
tfidf = TfidfVectorizer()

# Fit on the training data
tfidf.fit(X_train)

# Use the object
X_train_vectorized = tfidf.transform(X_train)
X_validate_vectorized = tfidf.transform(X_validate)
X_test_vectorized = tfidf.transform(X_test)

In [13]:
# Now that we have a vectorized dataset, we can use our classification tools!
lm = LogisticRegression()

# Fit the classification model on our vectorized train data
lm.fit(X_train_vectorized, y_train)

LogisticRegression()

In [14]:
#create dataframes of actual values
train = pd.DataFrame(dict(actual=y_train))
validate = pd.DataFrame(dict(actual=y_validate))
test = pd.DataFrame(dict(actual=y_test))

In [15]:
# Use the trained model to predict y given those vectorized inputs of X
train['predicted'] = lm.predict(X_train_vectorized)
validate["predicted"] = lm.predict(X_validate_vectorized)
test['predicted'] = lm.predict(X_test_vectorized)

In [16]:
# Train Accuracy
train_accuracy = round((train.actual == train.predicted).mean(),2)
train_accuracy

0.92

In [17]:
train.head()

Unnamed: 0,actual,predicted
137,world,technology
83,entertainment,entertainment
43,sports,sports
13,business,business
51,technology,technology


In [18]:
# Out of sample accuracy
validate_accuracy = round((validate.actual == validate.predicted).mean(), 2)
validate_accuracy 

0.61

In [19]:
from sklearn.metrics import classification_report
print(classification_report(train.actual, train.predicted))

               precision    recall  f1-score   support

     business       0.88      1.00      0.93        14
entertainment       0.93      1.00      0.97        14
      science       1.00      0.77      0.87        13
       sports       1.00      0.93      0.96        14
   technology       0.93      0.93      0.93        14
        world       0.80      0.86      0.83        14

     accuracy                           0.92        83
    macro avg       0.92      0.91      0.91        83
 weighted avg       0.92      0.92      0.92        83



In [20]:
#begin building a dataframe to record accuracy
metric_df = pd.DataFrame(data=[{
    'model': 'logistic regression', 
    'train_accuracy': round(train_accuracy, 2),
    'validate_accuracy': validate_accuracy}])
metric_df

Unnamed: 0,model,train_accuracy,validate_accuracy
0,logistic regression,0.92,0.61


In [21]:
from sklearn.metrics import classification_report
print(classification_report(validate.actual, validate.predicted))

               precision    recall  f1-score   support

     business       0.71      0.83      0.77         6
entertainment       0.50      0.67      0.57         6
      science       1.00      0.50      0.67         6
       sports       0.57      0.67      0.62         6
   technology       0.50      0.50      0.50         6
        world       0.60      0.50      0.55         6

     accuracy                           0.61        36
    macro avg       0.65      0.61      0.61        36
 weighted avg       0.65      0.61      0.61        36



In [22]:
# Test Accuracy
(test.actual == test.predicted).mean()

0.6333333333333333

In [23]:
from sklearn.metrics import classification_report
print(classification_report(validate.actual, validate.predicted))

               precision    recall  f1-score   support

     business       0.71      0.83      0.77         6
entertainment       0.50      0.67      0.57         6
      science       1.00      0.50      0.67         6
       sports       0.57      0.67      0.62         6
   technology       0.50      0.50      0.50         6
        world       0.60      0.50      0.55         6

     accuracy                           0.61        36
    macro avg       0.65      0.61      0.61        36
 weighted avg       0.65      0.61      0.61        36



In [24]:
test.head()

Unnamed: 0,actual,predicted
82,entertainment,entertainment
84,entertainment,entertainment
19,business,entertainment
90,entertainment,entertainment
130,world,business


In [25]:
precision_score(test.actual, test.predicted, labels=None, pos_label=1, average='weighted', sample_weight=None, zero_division='warn')

0.662037037037037

### Random Forest Classifier

In [26]:
y_train

137            world
83     entertainment
43            sports
13          business
51        technology
           ...      
52        technology
94     entertainment
47            sports
8           business
75     entertainment
Name: category, Length: 83, dtype: object

In [27]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix


#Create the model
rf = RandomForestClassifier(bootstrap=True, 
                            class_weight=None, 
                            criterion='gini',
                            min_samples_leaf=3,
                            n_estimators=100,
                            max_depth=3, 
                            random_state=123)

In [28]:
# Create the tfidf vectorizer object
# Step 1, this creates a tf-idf values for each word, for each document
# Step 2, encodes these values so that we can use models that only work on numbers, like classifications model
tfidf = TfidfVectorizer()

# Fit on the training data
tfidf.fit(X_train)

# Use the object
X_train_vectorized = tfidf.transform(X_train)
X_validate_vectorized = tfidf.transform(X_validate)
X_test_vectorized = tfidf.transform(X_test)

In [29]:
#Fit the model
rf.fit(X_train_vectorized, y_train)

RandomForestClassifier(max_depth=3, min_samples_leaf=3, random_state=123)

In [30]:
#Check feature importance
print(rf.feature_importances_)

[0.         0.00488342 0.         ... 0.         0.         0.        ]


In [31]:
#Make predictions
y_train_pred = rf.predict(X_train_vectorized)
y_validate_pred = rf.predict(X_validate_vectorized)

In [32]:
#Estimate the probability
y_train_pred_proba = rf.predict_proba(X_train_vectorized)
y_validate_pred_proba = rf.predict_proba(X_validate_vectorized)

In [33]:
print('Accuracy of random forest classifier on training set: {:.2f}'
     .format(rf.score(X_train_vectorized, y_train)))

Accuracy of random forest classifier on training set: 0.86


In [34]:
rf_train_accuracy = round(rf.score(X_train_vectorized, y_train),2)

In [35]:
print(confusion_matrix(y_train, y_train_pred))

[[12  0  0  0  0  2]
 [ 0 13  0  1  0  0]
 [ 0  1 10  0  0  2]
 [ 0  0  0 13  0  1]
 [ 1  0  1  0 12  0]
 [ 1  0  1  0  1 11]]


In [36]:
print(classification_report(y_train, y_train_pred))

               precision    recall  f1-score   support

     business       0.86      0.86      0.86        14
entertainment       0.93      0.93      0.93        14
      science       0.83      0.77      0.80        13
       sports       0.93      0.93      0.93        14
   technology       0.92      0.86      0.89        14
        world       0.69      0.79      0.73        14

     accuracy                           0.86        83
    macro avg       0.86      0.85      0.86        83
 weighted avg       0.86      0.86      0.86        83



In [37]:
### Less precision than Logistic Regression -- will not run on test!
#Check accuracy on validate
print('Accuracy of random forest classifier on validate set: {:.2f}'
     .format(rf.score(X_validate_vectorized, y_validate)))

Accuracy of random forest classifier on validate set: 0.42


In [38]:
rf_validate_accuracy = round(rf.score(X_validate_vectorized, y_validate),2)

In [39]:
#append dataframe to compare accuracy
metric_df = metric_df.append({
    'model': 'random_forest', 
    'train_accuracy': rf_train_accuracy,
    'validate_accuracy': rf_validate_accuracy}, ignore_index=True)
metric_df


Unnamed: 0,model,train_accuracy,validate_accuracy
0,logistic regression,0.92,0.61
1,random_forest,0.86,0.42


### K-Nearest Neighbor

In [40]:
#imports 
from sklearn.neighbors import KNeighborsClassifier

In [41]:
# Create the object
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')

In [42]:
#Fit the model
knn.fit(X_train_vectorized, y_train)

KNeighborsClassifier()

In [43]:
#Make predictions
y_train_pred = knn.predict(X_train_vectorized)

In [44]:
#Estimate probability
y_train_pred_proba = knn.predict_proba(X_train_vectorized)

In [45]:
#Evaluate on accuracy
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train_vectorized, y_train)))

Accuracy of KNN classifier on training set: 0.65


In [46]:
knn_train_accuracy = knn.score(X_train_vectorized, y_train)
knn_train_accuracy

0.6506024096385542

In [47]:
#Make predictions
y_validate_pred = knn.predict(X_validate_vectorized)

In [48]:
#Estimate probability
y_validate_pred_proba = knn.predict_proba(X_validate_vectorized)

In [49]:
#Evaluate on accuracy
print('Accuracy of KNN classifier on validate set: {:.2f}'
     .format(knn.score(X_validate_vectorized, y_validate)))

Accuracy of KNN classifier on validate set: 0.47


In [50]:
knn_validate_accuracy = round(knn.score(X_validate_vectorized, y_validate),2)
knn_validate_accuracy

0.47

In [51]:
#append dataframe to compare accuracy
metric_df = metric_df.append({
    'model': 'K-Nearest Neighbor', 
    'train_accuracy': round(knn_train_accuracy,2),
    'validate_accuracy': knn_validate_accuracy}, ignore_index=True)
metric_df


Unnamed: 0,model,train_accuracy,validate_accuracy
0,logistic regression,0.92,0.61
1,random_forest,0.86,0.42
2,K-Nearest Neighbor,0.65,0.47


### Takeaways -- 
Logistic regression has the highest accuracy of all three models on out-of-sample validate data. Will run logistic regression on test data set

In [52]:
from sklearn.metrics import classification_report
print(classification_report(test.actual, test.predicted))

               precision    recall  f1-score   support

     business       0.50      0.40      0.44         5
entertainment       0.56      1.00      0.71         5
      science       0.75      0.60      0.67         5
       sports       1.00      0.80      0.89         5
   technology       0.50      0.60      0.55         5
        world       0.67      0.40      0.50         5

     accuracy                           0.63        30
    macro avg       0.66      0.63      0.63        30
 weighted avg       0.66      0.63      0.63        30



### Logistic regression predicts with 63% accuracy, 66% precision, 63% recall