## **Notebook Contents**

- [Import Libraries](#importlibraries)  
- [Import Dataframes](#importdataframes)
- [Merge the Data](#mergedata)  
- [Word Cleaning](#wordcleaning)
- [Word EDA](#wordeda)
- [Train/Test Split](#train/test/split)   
- [Simple Logistic Regression](#simplelogreg) 
- [Gridsearched Count Vectorizer for Logistic Regression and Naive Bayes](#grcvlrnb)  
- [Confusion Matrix](#cm)  
- [Scores](#scores)

<a name="importlibraries"></a>
## **Import Libraries**

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split, GridSearchCV
from bs4 import BeautifulSoup       
from nltk.corpus import stopwords
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB



<a name="importdataframes"></a>
## **Import Dataframes**

In [2]:
data_ai = pd.read_csv('./data/data_ai.csv')
data_ml = pd.read_csv('./data/data_ml.csv')

In [3]:
data_ai.head(1)

Unnamed: 0,subreddit,title,selftext
0,artificial,Could AI ethics draw on non-Western philosophi...,


In [4]:
data_ml.head(1)

Unnamed: 0,subreddit,title,selftext
0,MachineLearning,[R] Taming pretrained transformers for eXtreme...,New X-Transformer model from Amazon Research\n...


In [5]:
print(f'The shape of the AI dataframe is {data_ai.shape}')
print(f'The shape of the ML dataframe is {data_ml.shape}')

The shape of the AI dataframe is (31299, 3)
The shape of the ML dataframe is (31299, 3)


<a name="mergedata"></a>
## **Merge the Data**

In [6]:
df = data_ai.append(data_ml).reset_index()

In [7]:
df.drop(columns='index',inplace=True)

In [8]:
print(f'The shape of the merged AI and ML dataframes are {df.shape}')

The shape of the merged AI and ML dataframes are (62598, 3)


In [9]:
df.isnull().sum()

subreddit        0
title            0
selftext     31046
dtype: int64

**Let's see what a title might look like:**

In [10]:
df['title'][0]

'Could AI ethics draw on non-Western philosophies to help reframe AI ethics'

**Let's see what a selftext might look like:**

In [11]:
df['selftext'][34]

'I need to learn AI to make a project. Because of corona virus, our internal exams are probably not happening, so our teacher has decided to give high amount of marks to AI project.\n\nHow can I learn AI to build projects? I am saying university explicitly, because we have lots of other tasks to do in university as well, so please keep that in mind while suggesting me anything.'

<a name="wordcleaning"></a>
## **Word Cleaning**

In [12]:
# A lot of NaN values
df.isna().sum()

subreddit        0
title            0
selftext     31046
dtype: int64

In [13]:
# Replace the NaN values with ''
df['selftext'].replace(np.nan, '', inplace=True)

In [14]:
df.isna().sum()

subreddit    0
title        0
selftext     0
dtype: int64

In [17]:
df

Unnamed: 0,subreddit,title,selftext
0,artificial,Could AI ethics draw on non-Western philosophi...,
1,artificial,Realistic simulation of tearing meat and peeli...,
2,artificial,[R] Using Deep RL to Model Human Locomotion Co...,In the new paper [*Deep Reinforcement Learning...
3,artificial,Artificial Intelligence Easily Beats Human Fig...,
4,artificial,Foiling illicit cryptocurrency mining with art...,
...,...,...,...
62593,MachineLearning,What are some things that you wish you knew be...,[removed]
62594,MachineLearning,[D] Does anyone created a formal database for ...,I'm looking for a database that has sufficient...
62595,MachineLearning,"[P] Demo of ""Arbitrary Style Transfer with Sty...",Hi MachineLearning\n\nI'll introduce awsome st...
62596,MachineLearning,[R] Triplet loss for image retrieval,"Hi, there!\n\n \nThis is an example of image ..."


In [18]:
# Let's replace the [removed] and [deleted] with ''
df['selftext'].replace('[removed]', '', inplace = True)
df['selftext'].replace('[deleted]', '', inplace = True)

In [19]:
# Check the balance of the classes these are unbalanced
df['subreddit'].value_counts()

MachineLearning    31299
artificial         31299
Name: subreddit, dtype: int64

In [20]:
df

Unnamed: 0,subreddit,title,selftext
0,artificial,Could AI ethics draw on non-Western philosophi...,
1,artificial,Realistic simulation of tearing meat and peeli...,
2,artificial,[R] Using Deep RL to Model Human Locomotion Co...,In the new paper [*Deep Reinforcement Learning...
3,artificial,Artificial Intelligence Easily Beats Human Fig...,
4,artificial,Foiling illicit cryptocurrency mining with art...,
...,...,...,...
62593,MachineLearning,What are some things that you wish you knew be...,
62594,MachineLearning,[D] Does anyone created a formal database for ...,I'm looking for a database that has sufficient...
62595,MachineLearning,"[P] Demo of ""Arbitrary Style Transfer with Sty...",Hi MachineLearning\n\nI'll introduce awsome st...
62596,MachineLearning,[R] Triplet loss for image retrieval,"Hi, there!\n\n \nThis is an example of image ..."


In [21]:
# TEXT CLEANING FUNCTION FOR EVERY POST IN BOTH SUBREDDITS TITLE + SELFTEXT

# These will be replaced by a space ' ' 
symbol_replace_space = re.compile('[/(){}\[\]\|@,;]')

# We will get rid of all these in the function below
bad_symbols = re.compile('[^0-9a-z #+_]')

# We will get rid of all of the stopwords
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):

    # Make all of the text lower case
    text = text.lower() 

    # Replace symbol_replace_space with a space 
    text = symbol_replace_space.sub(' ', text)
    
    # Replace bad_symbols with a space
    text = bad_symbols.sub('', text) 
    
    # This gets rid of the integers
    text = re.sub(r'\d+', '', text) 

    # remove stopwords from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) 

    return text

# Applying the clean_text function above to every title in df['title']
df['title'] = df['title'].apply(clean_text)
df['selftext'] = df['selftext'].apply(clean_text)

In [22]:
df

Unnamed: 0,subreddit,title,selftext
0,artificial,could ai ethics draw nonwestern philosophies h...,
1,artificial,realistic simulation tearing meat peeling chee...,
2,artificial,r using deep rl model human locomotion control...,new paper deep reinforcement learning modeling...
3,artificial,artificial intelligence easily beats human fig...,
4,artificial,foiling illicit cryptocurrency mining artifici...,
...,...,...,...
62593,MachineLearning,things wish knew starting domain machine learning,
62594,MachineLearning,anyone created formal database word meaning,im looking database sufficient information mat...
62595,MachineLearning,p demo arbitrary style transfer styleattention...,hi machinelearningill introduce awsome style t...
62596,MachineLearning,r triplet loss image retrieval,hi example image retrieval based mnist fashion...


**We need to concat the title and selftext so we can put it into one columns so we can count vectorize**
- https://stackoverflow.com/questions/34710281/use-featureunion-in-scikit-learn-to-combine-two-pandas-columns-for-tfidf

In [23]:
# This is how we created the new column in our dataframe
df['title_selftext'] = df['title'] + ' ' + df['selftext']

In [24]:
df.head()

Unnamed: 0,subreddit,title,selftext,title_selftext
0,artificial,could ai ethics draw nonwestern philosophies h...,,could ai ethics draw nonwestern philosophies h...
1,artificial,realistic simulation tearing meat peeling chee...,,realistic simulation tearing meat peeling chee...
2,artificial,r using deep rl model human locomotion control...,new paper deep reinforcement learning modeling...,r using deep rl model human locomotion control...
3,artificial,artificial intelligence easily beats human fig...,,artificial intelligence easily beats human fig...
4,artificial,foiling illicit cryptocurrency mining artifici...,,foiling illicit cryptocurrency mining artifici...


## **Train/Test Split**

In [29]:
X = df[['title_selftext']]
y = df['subreddit']

In [30]:
X.shape

(62598, 1)

In [31]:
y.shape

(62598,)

In [33]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state=42)

In [34]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(46948, 1)
(15650, 1)
(46948,)
(15650,)


In [35]:
y_train.value_counts()

MachineLearning    23474
artificial         23474
Name: subreddit, dtype: int64

In [36]:
y_test.value_counts()

MachineLearning    7825
artificial         7825
Name: subreddit, dtype: int64

## **Count Vectorizer**

In [39]:
# Instantiate the "CountVectorizer" object, which is scikit-learn's bag of words tool
vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 20000,
                             min_df = 2
                             )

In [40]:
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.

train_data_features = vectorizer.fit_transform(X_train['title_selftext'])

test_data_features = vectorizer.transform(X_test['title_selftext'])

# Numpy arrays are easy to work with, so convert the result to an array.
train_data_features = train_data_features.toarray()

In [41]:
train_data_features

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## **Simple Logistic Regression**

In [42]:
# Instantiate the Logistic Regression
lr = LogisticRegression(solver='liblinear')

In [43]:
lr.fit(train_data_features, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [44]:
lr.score(train_data_features, y_train)

0.8884723523898782

In [45]:
lr.score(test_data_features, y_test)

0.8046645367412141

In [46]:
print(f"LogisticRegression with Gridsearch training accuracy is: {lr.score(train_data_features, y_train)}")
print(f"LogisticRegression with Gridsearch testing accuracy is: {lr.score(test_data_features, y_test)}")

LogisticRegression with Gridsearch training accuracy is: 0.8884723523898782
LogisticRegression with Gridsearch testing accuracy is: 0.8046645367412141


## **Gridsearched Count Vectorizer for Logistic Regression and Naive Bayes**

In [47]:
# Pipeline for CountVectorizer for Naive Bayes and Logisitc Regression
pipe_cvec_lr = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(solver='liblinear'))
])

pipe_cvec_nb = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())    
])

In [48]:
pipe_params = {
    'cvec__max_features': [5000, 10_000, 15_000, 20_000, 25_000],
    'cvec__min_df': [2, 3],
    'cvec__ngram_range': [(1, 1), (1,2)],
#     'cvec__max_df': [.90, .95]
    }

In [49]:
# Instantiate GridSearchCV.

#LR
gs_cvec_lr = GridSearchCV(pipe_cvec_lr, # what object are we optimizing?
                  param_grid = pipe_params, # what parameters values are we searching?
                  cv = 5) # 5-fold cross-validation.

#NB
gs_cvec_nb = GridSearchCV(pipe_cvec_nb, # what object are we optimizing?
                  param_grid = pipe_params, # what parameters values are we searching?
                  cv = 5) # 5-fold cross-validation.

In [50]:
gs_cvec_lr.fit(X_train['title_selftext'], y_train)



GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [51]:
gs_cvec_nb.fit(X_test['title_selftext'], y_test)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [53]:
gs_cvec_lr.best_params_

{'cvec__max_features': 25000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2)}

In [54]:
gs_cvec_nb.best_params_

{'cvec__max_features': 25000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2)}

In [52]:
# Score model train & test for Logisitic Regression
print(f"LogisticRegression with Gridsearch training accuracy is: {gs_cvec_lr.score(X_train['title_selftext'], y_train)}")
print(f"LogisticRegression with Gridsearch testing accuracy is: {gs_cvec_lr.score(X_test['title_selftext'], y_test)}")

LogisticRegression with Gridsearch training accuracy is: 0.9026795603646588
LogisticRegression with Gridsearch testing accuracy is: 0.8090095846645368


In [335]:
# Score model on train & test for Naive Bayes
print(f"Naive Bayes with Gridsearch training accuracy is: {gs_cvec_nb.score(X_train['title_selftext'], y_train)}")
print(f"Naive Bayes with Gridsearch training accuracy is: {gs_cvec_nb.score(X_test['title_selftext'], y_test)}")

Naive Bayes with Gridsearch training accuracy is: 0.7982451828548958
Naive Bayes with Gridsearch training accuracy is: 0.810588772250114


## **Scores**

Logistic Regression with Gridsearch training accuracy is: 0.8884723523898782  
Logistic Regression with Gridsearch testing accuracy is: 0.8046645367412141  

Logistic Regression with Gridsearch training accuracy is: 0.9026795603646588  
Logistic Regression with Gridsearch testing accuracy is: 0.8090095846645368   

Naive Bayes with Gridsearch training accuracy is: 0.7982451828548958  
Naive Bayes with Gridsearch training accuracy is: 0.810588772250114  