## 0.3 Bot or Human?: Exploring Data from Subreddits


### Contents:
- [Import Libraries](#Import-Libraries)
- [Read in data scrapped from subreddits](#Read-in-data-scrapped-from-subreddits)
- [Data Processing](#Data-Processing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)

## Import Libraries

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import CountVectorizer

## Read in data scrapped from subreddits

In [2]:
df_bot = pd.read_csv('../Data/bot.csv')

In [3]:
df_human = pd.read_csv('../Data/human.csv')

## Data Processing

In [4]:
def clean_posts(data):
    '''
    1. Captures punctuation that specifically escapes string.punctuation
    2. Converts to lowercase
    3. Removes punctuation
    '''
    data['post_title'] = data['post_title'].str.lower().str.replace('’',':').str.replace('‘',':').str.replace('“',':')
    data['post_title_clean'] = data['post_title'].map(lambda x : ''.join(k for k in x if k not in string.punctuation))
    return data

In [5]:
for i in [df_bot,df_human]:
    clean_posts(i)

In [6]:
X_bot = df_bot['post_title_clean']
X_human = df_human['post_title_clean']

In [7]:
def common_words(corpus, num_words):
    '''
    1. Converts text data into matrix of token counts
    2. Sums and sorts words in data, starting from highest count
    3. Returns a number of words (num_words) which has the highest counts
    '''
    cvec = CountVectorizer(max_features=500, stop_words='english')
    df = pd.DataFrame(cvec.fit_transform(corpus).todense(),columns=cvec.get_feature_names())
    word_counts = df.sum(axis=0)
    return word_counts.sort_values(ascending = False).head(num_words)

In [8]:
common_words(X_bot, 20)

like      44
just      38
im        33
time      30
dont      28
man       27
know      25
people    24
vs        24
got       23
years     21
day       21
did       20
good      20
need      19
help      19
new       18
think     17
friend    17
mrw       15
dtype: int64

In [9]:
common_words(X_human, 20)

people      100
probably     63
just         58
time         52
like         38
life         37
way          36
know         35
dont         31
years        29
youre        29
actually     27
world        27
make         26
day          26
water        23
person       23
say          20
think        19
humans       19
dtype: int64