## 0.2 Bot or Human?: Exploring Data from Subreddits


### Contents:
- [Import Libraries](#Import-Libraries)
- [Read in data scrapped from subreddits](#Read-in-data-scrapped-from-subreddits)
- [Data Processing](#Data-Processing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)

## Import Libraries

In [1]:
import pandas as pd
import string
from sklearn.feature_extraction.text import CountVectorizer

## Read in data scrapped from subreddits

In [2]:
df_bot = pd.read_csv('../Data/bot.csv')

In [3]:
df_human = pd.read_csv('../Data/human.csv')

## Data Processing

To clean the data, punctuation is removed.


In [4]:
def clean_posts(data):
    '''
    1. Captures punctuation that specifically escapes string.punctuation
    2. Converts to lowercase
    3. Removes punctuation
    '''
    data['post_title'] = data['post_title'].str.lower().str.replace('’',':').str.replace('‘',':').str.replace('“',':')
    data['post_title_clean'] = data['post_title'].map(lambda x : ''.join(k for k in x if k not in string.punctuation))
    return data

In [5]:
for i in [df_bot,df_human]:
    clean_posts(i)

In [6]:
X_bot = df_bot['post_title_clean']
X_human = df_human['post_title_clean']

## Exploratory Data Analysis

I run the below code to find the top 20 common words in both Subreddits and add the words that are not specific to any one Subreddit to stop words, to filter out during modelling.

In [7]:
def common_words(corpus, num_words):
    '''
    1. Converts text data into matrix of token counts
    2. Sums and sorts words in data, starting from highest count
    3. Returns a number of words (num_words) which has the highest counts
    '''
    cvec = CountVectorizer(max_features=500, stop_words='english')
    df = pd.DataFrame(cvec.fit_transform(corpus).todense(),columns=cvec.get_feature_names())
    word_counts = df.sum(axis=0)
    return word_counts.sort_values(ascending = False).head(num_words)

In [8]:
common_words(X_bot, 20)

like      46
just      37
im        35
dont      31
time      28
vs        25
got       24
know      24
people    24
man       24
new       22
did       20
day       20
years     19
good      19
help      18
need      18
life      16
work      16
want      16
dtype: int64

In [9]:
common_words(X_human, 20)

people      112
probably     72
like         63
just         50
time         46
world        42
life         41
dont         37
really       31
know         31
years        29
youre        28
person       27
make         24
good         24
think        24
actually     23
humans       23
earth        22
going        22
dtype: int64