# Data Cleaning and EDA

## Process of cleaning 

*Code written for cleaning in [cleaning.py](projects/project-3/code/cleaning.py).*

* We'll pass a list of the csv files from reddit into a function that cleans and concatenates the csvs into one dataframe.
    * For our purposes, it doesn't make much sense to keep posts that are null in the "selftext" column.
    * Next, I will remove any rows where this is the case. We'll also rename the "Unnamed: 0" column to "post_id".
    * Also, let's remove the "comments", utc, and title columns. Pulled these in just in case, but for now, I am not planning to access them.
    * I'll add a column that assign 0/1 values for the subreddit: 0 for the rpg subreddit, 1 for the osr subreddit
    * Return the concatenated dataframe
* Once the completed dataframe is created, use "post_id" to deduplicate the data. We'll get a final count of the subreddit post totals to see if we need to pull in more data. 

In [8]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
import sys
import os
sys.path.append(os.path.abspath("./"))
import importlib
#importlib.reload(cleaning)  # Force reload the module


import cleaning

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer



In [15]:
files = [f"../data/{file}" for file in os.listdir("../data") if file.endswith(".csv")]
files.remove('../data/cleaned_df.csv')
files.remove('../data/cleaned_with_sentiment_df.csv')

files

['../data/osr_244-489.csv',
 '../data/nsr_1000-1150.csv',
 '../data/osr_1931-2163.csv',
 '../data/osr_1683-1931.csv',
 '../data/nsr_1150-1394.csv',
 '../data/osr_0-243.csv',
 '../data/osr_490-1190.csv',
 '../data/osr_1190-1433.csv',
 '../data/nsr_1394-1628.csv',
 '../data/nsr_0-999.csv',
 '../data/nsr_1628-1862.csv',
 '../data/osr_1433-1683.csv']

In [16]:
cleaned = cleaning.clean_data(files)

In [17]:
cleaned = cleaned.drop_duplicates(subset="post_id")

In [18]:
cleaned["is_osr"].value_counts()


is_osr
1    1145
0    1102
Name: count, dtype: int64

### Now that we have at least 1000 posts from each of our subreddits, so I'm going to save our cleaned dataframe for safe keeping

In [19]:
#cleaned.to_csv("../data/cleaned_df.csv")

Want to do a sentiment analysis on posts from this subreddit. 

In [21]:
sa = SentimentIntensityAnalyzer()

In [27]:
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html

sentiment_data = [sa.polarity_scores(text) for text in cleaned["selftext"]]
sentiment_df = pd.DataFrame.from_dict(sentiment_data)
sentiment_df 


(2247, 4)

In [33]:
cleaned.shape
cleaned.head()
cleaned.tail()

# indexes are not unique, so I will reset them to make concatenation possible

cleaned = cleaned.reset_index()

In [34]:
cleaned.tail()

Unnamed: 0,index,post_id,selftext,subreddit,is_osr
2242,243,1j0hj6f,Pretty much the title. I am curious in terms o...,osr,1
2243,244,1j0gblw,I have a database of creature stats for PF2 an...,osr,1
2244,246,1j0fy0e,Not an OSR player but running a hex crawl for ...,osr,1
2245,247,1j0e1wc,I'm running an old school Adventure I won't na...,osr,1
2246,248,1j0dwly,My wife has expressed interest in giving ttrpg...,osr,1


In [37]:
cleaned = pd.concat([cleaned, sentiment_df],axis=1)
cleaned.head()

Unnamed: 0,index,post_id,selftext,subreddit,is_osr,neg,neu,pos,compound
0,1,1irnln3,HeroQuest is the perfect entry into OSR DND. ...,osr,1,0.118,0.708,0.174,0.5313
1,2,1f3scds,"To be clear, it was a lot of work before the g...",osr,1,0.094,0.76,0.147,0.9769
2,3,1di0qn6,"So, I know there was a thread discussing peopl...",osr,1,0.066,0.859,0.075,0.3694
3,4,1g5ga0h,Really loving the booklet layout. Open up char...,osr,1,0.0,0.758,0.242,0.9237
4,5,1grfhij,In this video I discuss why I consider Castles...,osr,1,0.0,0.701,0.299,0.5719


Now that we have a dataframe with our reddit data, let's do some exploration on it. I will do some preprocessing on the dataframe here as well to be able to learn more about our data

In [38]:
cleaned.shape
cleaned["is_osr"].value_counts(normalize=True) 


is_osr
1    0.509568
0    0.490432
Name: proportion, dtype: float64

Our baseline accuracy is 50.9% that subreddit will be from the osr subreddit.  First, let's normalize our text by our tokenizing, lemmatizing, stemming, and  removing special characters and stop words. 


In [39]:
translator = str.maketrans('', '', string.punctuation)

#lowercase all text in "selftext" column
cleaned["selftext"] = cleaned["selftext"].map(lambda x: x.translate(translator).lower())

In [40]:
#tokenize text in "selftext" column
cleaned["selftext"] = [word_tokenize(post) for post in cleaned["selftext"]]

In [41]:
cleaned.head()

Unnamed: 0,index,post_id,selftext,subreddit,is_osr,neg,neu,pos,compound
0,1,1irnln3,"[heroquest, is, the, perfect, entry, into, osr...",osr,1,0.118,0.708,0.174,0.5313
1,2,1f3scds,"[to, be, clear, it, was, a, lot, of, work, bef...",osr,1,0.094,0.76,0.147,0.9769
2,3,1di0qn6,"[so, i, know, there, was, a, thread, discussin...",osr,1,0.066,0.859,0.075,0.3694
3,4,1g5ga0h,"[really, loving, the, booklet, layout, open, u...",osr,1,0.0,0.758,0.242,0.9237
4,5,1grfhij,"[in, this, video, i, discuss, why, i, consider...",osr,1,0.0,0.701,0.299,0.5719


In [42]:
#remove stop words
cleaned["selftext"] = cleaned["selftext"].map(lambda x: [token for token in x if token not in stopwords.words("english")])
#column counting number of words after stop words removed
cleaned["num_of_token_wo_stop"] = cleaned["selftext"].map(lambda x: len(x))
cleaned.head()

Unnamed: 0,index,post_id,selftext,subreddit,is_osr,neg,neu,pos,compound,num_of_token_wo_stop
0,1,1irnln3,"[heroquest, perfect, entry, osr, dnd, sevenyea...",osr,1,0.118,0.708,0.174,0.5313,20
1,2,1f3scds,"[clear, lot, work, game, started, run, jacob, ...",osr,1,0.094,0.76,0.147,0.9769,214
2,3,1di0qn6,"[know, thread, discussing, peoples, disappoint...",osr,1,0.066,0.859,0.075,0.3694,82
3,4,1g5ga0h,"[really, loving, booklet, layout, open, charac...",osr,1,0.0,0.758,0.242,0.9237,35
4,5,1grfhij,"[video, discuss, consider, castles, crusades, ...",osr,1,0.0,0.701,0.299,0.5719,8


### Also decided know that number of tokens remaining after removing stop words would be interesting

#### people on these subreddits can be verbose, so we'll use stemming instead of lemmatizing. This will be faster and not require us to tag all of the text we currently have with the part of speech it belings to to ensure appropriate conjugation. 

In [43]:
ps = PorterStemmer()
cleaned["selftext"] = cleaned["selftext"].map(lambda x: [ps.stem(token) for token in x ])
cleaned.head()

Unnamed: 0,index,post_id,selftext,subreddit,is_osr,neg,neu,pos,compound,num_of_token_wo_stop
0,1,1irnln3,"[heroquest, perfect, entri, osr, dnd, sevenyea...",osr,1,0.118,0.708,0.174,0.5313,20
1,2,1f3scds,"[clear, lot, work, game, start, run, jacob, fl...",osr,1,0.094,0.76,0.147,0.9769,214
2,3,1di0qn6,"[know, thread, discuss, peopl, disappoint, sys...",osr,1,0.066,0.859,0.075,0.3694,82
3,4,1g5ga0h,"[realli, love, booklet, layout, open, charact,...",osr,1,0.0,0.758,0.242,0.9237,35
4,5,1grfhij,"[video, discuss, consid, castl, crusad, true, ...",osr,1,0.0,0.701,0.299,0.5719,8


### **Let's save this dataframe for safe keeping**

In [44]:
cleaned.to_csv("../data/cleaned_with_sentiment_df.csv")




# week 5: nlp1 nlp2

# week 6: boosting


### Let's look at Countvectorizer
#### We preprocessed the data by hand above, but CountVectorizer can do this for us as well

In [None]:
# eda using CountVectorizing in nlp 2 lesson

Since our classes are balanced, for now, we won't need to normalize our data.