## **Notebook Contents**

- [Import Libraries](#importlibraries)  
- [Import Dataframes](#importdataframes)
- [Merge the Data](#mergedata)  
- [Word Cleaning](#wordcleaning)
- [Word EDA](#wordeda)
- [Train/Test Split](#train/test/split)   
- [Simple Logistic Regression](#simplelogreg) 
- [Gridsearched Count Vectorizer for Logistic Regression and Naive Bayes](#grcvlrnb)  
- [Confusion Matrix](#cm)  
- [Scores](#scores)

<a name="importlibraries"></a>
## **Import Libraries**

In [1]:
import pandas as pd
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split, GridSearchCV
from bs4 import BeautifulSoup       
from nltk.corpus import stopwords
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

<a name="importdataframes"></a>
## **Import Dataframes**

In [2]:
data_ai = pd.read_csv('../project_3-master/data/data_ai.csv')
data_ml = pd.read_csv('../project_3-master/data/data_ml.csv')

In [3]:
data_ai.head(1)

Unnamed: 0,subreddit,title,selftext
0,artificial,Could AI ethics draw on non-Western philosophi...,


In [4]:
data_ml.head(1)

Unnamed: 0,subreddit,title,selftext
0,MachineLearning,[R] Taming pretrained transformers for eXtreme...,New X-Transformer model from Amazon Research\n...


<a name="mergedata"></a>
## **Merge the Data**

In [57]:
df = data_ai.append(data_ml).reset_index()

In [58]:
df.drop(columns='index',inplace=True)

In [59]:
df

Unnamed: 0,subreddit,title,selftext
0,artificial,Could AI ethics draw on non-Western philosophi...,
1,artificial,Realistic simulation of tearing meat and peeli...,
2,artificial,[R] Using Deep RL to Model Human Locomotion Co...,In the new paper [*Deep Reinforcement Learning...
3,artificial,Artificial Intelligence Easily Beats Human Fig...,
4,artificial,Foiling illicit cryptocurrency mining with art...,
...,...,...,...
62593,MachineLearning,What are some things that you wish you knew be...,[removed]
62594,MachineLearning,[D] Does anyone created a formal database for ...,I'm looking for a database that has sufficient...
62595,MachineLearning,"[P] Demo of ""Arbitrary Style Transfer with Sty...",Hi MachineLearning\n\nI'll introduce awsome st...
62596,MachineLearning,[R] Triplet loss for image retrieval,"Hi, there!\n\n \nThis is an example of image ..."


**Let's see what a title might look like:**

In [60]:
df['title'][0]

'Could AI ethics draw on non-Western philosophies to help reframe AI ethics'

**Let's see what a selftext might look like:**

In [61]:
df['selftext'][68]

'Chatbots continue to be a topic of much discourse in media outlets and vendor communities. But one industry where chatbots hold the power of total transformation is customer service. Some of the leading brands in the world are employing chatbots to enhance their customer engagement and get more people to try their products.'

## **Train/Test Split**

In [62]:
X = df[['title']]
y = df['subreddit']

In [63]:
X.head(1)

Unnamed: 0,title
0,Could AI ethics draw on non-Western philosophi...


In [64]:
X.shape

(62598, 1)

In [65]:
y.shape

(62598,)

## **Baseline Score**

In [67]:
# This is the baseline
y.value_counts(normalize=True)

artificial         0.5
MachineLearning    0.5
Name: subreddit, dtype: float64

The baseline score is 50% accuracy because if we had to randomly pick a subreddit from a post we would have a 50% accuracy in picking either subreddit since our classes are evenly sampled.  

This is the score to beat