# Contents
- [Imports](#import)
- [Further Cleaning](#clean)
- [Splitting and Vectorizing](#splitvect)
- [Baseline Model](#model)

---
# Imports<a id=import></a>

In [13]:
import numpy as np
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

lemmatizer = WordNetLemmatizer()
stop=stopwords.words('english')

In [2]:
cleaned=pd.read_csv(r'..\datasets\cleaned.csv')

In [130]:
# check the data we imported
display(cleaned.head())
display(cleaned.tail())

Unnamed: 0,text,subreddit
0,how to excel in a physics undergraduate degree...,1
1,why do shorter wavelength form better images?i...,1
2,how to solve for maximum compression in a spri...,1
3,practice exam questioni have this question a ...,1
4,lagrangian mechanics and generalized forcesso ...,1


Unnamed: 0,text,subreddit
1987,speed of the trucka truck started from town a ...,0
1988,"factor questionhi, i'm trying to brush up on m...",0
1989,how to calculate the mean ? is there better wa...,0
1990,can anyone tell me what will be the sales ratio?,0
1991,drop chanceso i'm currently playing a game tha...,0


---
# Further Cleaning<a id=clean></a>
Before we create our baseline model, we perform further cleaning by removing stopwords, numeric characters and punctuation from our data.

In [4]:
#remove stopwords
cleaned['stoptext'] = cleaned['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [5]:
#remove all numeric characters
cleaned['stoptext'] = cleaned['stoptext'].str.replace('\d+', ' ')

In [6]:
#remove all punctuation
cleaned['stoptext'] = cleaned['stoptext'].str.replace(r'[^\s\w]+', ' ')

In [135]:
#check original text
cleaned.text[5]

'does pressure in a sealed container rise as it ascends in altitude?the source of this discussion is talking about inflating an inflatable stand up paddleboard at lower altitude and then driving it up to a mountain lake. these paddle boards have a stiff strong structure that holds their shape. they are designed to hold aprox. 15psi. if someone was to fill the paddleboard to 15psi say at 4,000ft altitude, then drive it to a mountain lake at say 8,000ft altitude, will the pressure in the paddleboard change from the change in altitude or is 15psi in a container, 15psi regardless of ambient pressure?'

In [136]:
#check cleaned text
cleaned.stoptext[5]

'pressure sealed container rise ascends altitude the source discussion talking inflating inflatable stand paddleboard lower altitude driving mountain lake  paddle boards stiff strong structure holds shape  designed hold aprox   psi  someone fill paddleboard  psi say    ft altitude  drive mountain lake say    ft altitude  pressure paddleboard change change altitude  psi container   psi regardless ambient pressure '

We observe that our cleaning has been performed successfully.<br/>
Next, we lemmatize our words to prevent repetition of different word variations.

In [7]:
cleaned['stoptext'] = cleaned['stoptext'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

In [138]:
cleaned.stoptext[5]

'pressure sealed container rise ascends altitude the source discussion talking inflating inflatable stand paddleboard lower altitude driving mountain lake paddle board stiff strong structure hold shape designed hold aprox psi someone fill paddleboard psi say ft altitude drive mountain lake say ft altitude pressure paddleboard change change altitude psi container psi regardless ambient pressure'

We observe that our lemmatizer has worked as the word 'holds' in our original text has been lemmatized to 'hold'.

---
# Splitting and Vectorizing<a id=splitvect></a>
Next, we perform our train-test-split and count vectorize on our data before we create our model.

In [8]:
X_train,X_test,y_train,y_test=train_test_split(cleaned.stoptext,cleaned.subreddit,random_state=42,stratify=cleaned.subreddit)

We check if our data has been stratified properly:

In [140]:
print(y_train.value_counts())
print(y_test.value_counts())

0    748
1    746
Name: subreddit, dtype: int64
0    250
1    248
Name: subreddit, dtype: int64


And we next perform our vectorization.

In [9]:
cvec = CountVectorizer(stop_words='english',strip_accents='unicode')
X_train=pd.DataFrame(cvec.fit_transform(X_train).todense(),columns=cvec.get_feature_names())

In [148]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1494 entries, 0 to 1493
Columns: 7236 entries, 4ρ4φ to τy
dtypes: int64(7236)
memory usage: 82.5 MB


It seems that our corpus has been vectorized into 7236 features.<br/>
Let us analyze which words appear the most.

In [123]:
# Which words appear the most?
word_counts = X_train.sum(axis=0)
print(word_counts.sort_values(ascending = False).head(20))
print(word_counts.sort_values(ascending = False).tail(20))

amp         649
question    508
time        410
know        396
help        396
like        334
problem     321
answer      313
number      305
physic      294
force       282
need        252
equation    248
energy      246
point       227
way         209
say         204
ve          196
work        196
speed       192
dtype: int64
maxed               1
matteri             1
matterhi            1
matlab              1
market              1
maththere           1
mathing             1
mathgiven           1
mathemathics        1
mathemagicians      1
mathconsumer        1
mathcal             1
mathbb              1
matched             1
masteringphysics    1
martingale          1
martial             1
marpocky            1
markup              1
4ρ4φ                1
dtype: int64


As expected, the most recurring words are basic terms like question or time or know or help.<br/>
The amp term is interesting and through manual observation of our data, it might likely be an encoding artifact left from our scraping process.<br/>
Let us analyze the most recurring terms in each of our subreddits.

In [124]:
print('Most common words in AskPhysics:')
print(X_train.loc[:,(pd.DataFrame(y_train)==1).subreddit.tolist()].sum(axis=0).sort_values(ascending=False).head(20))
print('Most common words in askmath:')
print(X_train.loc[:,(pd.DataFrame(y_train)==0).subreddit.tolist()].sum(axis=0).sort_values(ascending=False).head(20))

Most common words in AskPhysics:
answer          313
correct         121
change           90
acceleration     88
bit              86
charge           75
book             75
average          73
class            71
case             68
appreciated      64
assume           62
course           59
chance           57
black            49
calculus         46
approach         46
certain          46
card             45
basic            42
dtype: int64
Most common words in askmath:
amp            649
ball           174
calculate      115
angle           95
come            87
able            83
constant        82
algebra         78
air             77
area            69
advance         64
actually        62
cm              54
based           49
confused        49
calculation     48
best            46
center          40
axis            39
concept         38
dtype: int64


By looking for the most recurring words in our specific subreddits, it seems that more topic-specific terms appear.<br/>
In AskPhysics, two topic-specific highly recurring words are acceleration and charge. We can infer that there are substantial posts in this subreddit discussing kinematics and electromagnetic questions.<br/>
In askmath, two topic-specific highly recurring words are ball and angle. Ball-based questions in math are normally either related to kinematics (similar to physics), or statistics and probability. Angle being a recurring word implies that many posts are seeking trigonometric help.

---
# Baseline Model<a id=model></a>
Finally, we are ready to create our baseline model.<br/>
We start by making our X_test split into a dataframe.

In [10]:
X_test=pd.DataFrame(cvec.transform(X_test).todense(),columns=cvec.get_feature_names())

In [149]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Columns: 7236 entries, 4ρ4φ to τy
dtypes: int64(7236)
memory usage: 27.5 MB


For our baseline model, we will utilize the reliable Logistic Regression classifier with default settings.

In [11]:
logreg=LogisticRegression()
logreg.fit(X_train,y_train)
logreg.score(X_test,y_test)



0.893574297188755

We obtain a mean accuracy of 89.4% from our baseline model, which is quite impressive as a start.

In [14]:
pred=logreg.predict(X_test)
f1_score(y_test,pred)

0.8893528183716075

We calculate our f1 score that indicates the balance between our predicted and true positive rates.<br/>
We will use this as our main metric for scoring all future models.<br/>

In the next section, we will investigate how we may improve our model and create an optimal model to use as our final classifier.