# Scraping Analysis

## EDA

In [1]:
# Necessary libraries:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from bs4 import BeautifulSoup        
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import regex as re
from nltk.stem.porter import PorterStemmer

In [2]:
!pip install regex



After necessary packages are downloaded, we upload the csv files for the series that I scraped in the previous notebook.

In [3]:
blackmirror = pd.read_csv('./blackmirror.csv')
westworld = pd.read_csv('./westworld.csv')

In [4]:
blackmirror.head()

Unnamed: 0.1,Unnamed: 0,created_utc,id,is_video,num_comments,score,selftext,spoiler,subreddit,title
0,0,1545417655,a8czdm,False,0,1,,False,blackmirror,MIT was able to reconstruct sound from the vib...
1,1,1545415295,a8cklh,False,1,1,,False,blackmirror,"'Black Mirror: Bandersnatch' Synopsis, Runtime..."
2,2,1545413557,a8c9tm,False,0,1,,True,blackmirror,What is Black Mirror: Bandersnatch?
3,3,1545410154,a8bozn,False,4,1,not sure if i need to mark this as possible sp...,True,blackmirror,White Christmas Ending
4,4,1545404674,a8at5v,False,1,1,,False,blackmirror,This is a playlist with songs from some of the...


In [5]:
westworld.head()

Unnamed: 0.1,Unnamed: 0,created_utc,id,is_video,num_comments,score,selftext,spoiler,subreddit,title
0,0,1545416118,a8cprr,False,2,1,,False,westworld,"Is this....Bernard? Nah, just my friend's mom'..."
1,1,1545377064,a87law,False,1,1,,False,westworld,Is there cheaper ways to watch than Hulu?
2,2,1545371104,a86vwk,False,3,1,,False,westworld,Anyone else think it’s funny that Tessa Thomps...
3,3,1545370046,a86r6o,False,0,1,,False,westworld,Black Mirror [Online Game Code] is 67% OFF
4,4,1545364029,a860qt,False,2,1,,False,westworld,Clicked on Trending and thought that they rele...


There are unnecassary 'Unnamed: 0' column that can deleted. Then the NAs should be dealt. In order to do that I excluded all the empty cells since we had more than enough even when I exclude them for my models to run

In [6]:
blackmirror.drop('Unnamed: 0', inplace = True, axis = 1)
westworld.drop('Unnamed: 0', inplace = True, axis = 1)

In [7]:
blackmirror.isnull().sum()

created_utc        0
id                 0
is_video           0
num_comments       0
score              0
selftext        4321
spoiler            0
subreddit          0
title              0
dtype: int64

In [8]:
westworld.isnull().sum()

created_utc        0
id                 0
is_video           0
num_comments       0
score              0
selftext        4110
spoiler            0
subreddit          0
title              0
dtype: int64

In [9]:
blackmirror = blackmirror[blackmirror.selftext.notnull()]

In [10]:
blackmirror.shape[0]

5679

In [11]:
westworld = westworld[westworld.selftext.notnull()]

In [12]:
westworld.shape[0]

5890

In [13]:
series = pd.concat([blackmirror, westworld], axis = 0)

In [14]:
series.head()

Unnamed: 0,created_utc,id,is_video,num_comments,score,selftext,spoiler,subreddit,title
3,1545410154,a8bozn,False,4,1,not sure if i need to mark this as possible sp...,True,blackmirror,White Christmas Ending
6,1545373691,a877em,False,4,1,"Correct me if I'm wrong, but Bandersnatch woul...",True,blackmirror,How could Bandersnatch's unique choice-based f...
8,1545369058,a86mla,False,5,1,The purpose of the CP reveal is not there to m...,False,blackmirror,A rant about peoples' reactions to SUAD
12,1545362860,a85vmm,False,5,1,Mine was The Entire History of You. The techno...,False,blackmirror,Which episode got you hooked on Black Mirror?
13,1545354878,a84uyn,False,9,1,For me it wasn’t surprising because I already ...,True,blackmirror,Was Shut Up and Dance not surprising to anyone...


Then some columns need some changes:
- is_video is replaced with 1 and 0's so that there are only numeric values
- spoiler column revised in a same way.
- subreddit column is so revised that instead of Blackmirror, the cells were assigned 1 and instead of WestWorld they were assigned 0.


In [15]:
series['is_video'].value_counts()

False    11557
True        12
Name: is_video, dtype: int64

In [16]:
series.drop('is_video', inplace = True, axis = 1)

In [17]:
series['spoiler'].value_counts()

False    7392
True     4177
Name: spoiler, dtype: int64

In [18]:
series['spoiler'] = series['spoiler'].astype(int)

In [19]:
series.subreddit = series.subreddit.map(lambda cell: 1 if cell == 'blackmirror' else 0)

In [20]:
series.head()

Unnamed: 0,created_utc,id,num_comments,score,selftext,spoiler,subreddit,title
3,1545410154,a8bozn,4,1,not sure if i need to mark this as possible sp...,1,1,White Christmas Ending
6,1545373691,a877em,4,1,"Correct me if I'm wrong, but Bandersnatch woul...",1,1,How could Bandersnatch's unique choice-based f...
8,1545369058,a86mla,5,1,The purpose of the CP reveal is not there to m...,0,1,A rant about peoples' reactions to SUAD
12,1545362860,a85vmm,5,1,Mine was The Entire History of You. The techno...,0,1,Which episode got you hooked on Black Mirror?
13,1545354878,a84uyn,9,1,For me it wasn’t surprising because I already ...,1,1,Was Shut Up and Dance not surprising to anyone...


In [21]:
series.subreddit.value_counts(normalize = True)

0    0.509119
1    0.490881
Name: subreddit, dtype: float64

In [22]:
series = (series[series.selftext != "[removed]"])
series = (series[series.selftext != "[deleted]"])

In [23]:
series = series.reset_index(drop=True)

In [24]:
series.head()

Unnamed: 0,created_utc,id,num_comments,score,selftext,spoiler,subreddit,title
0,1545410154,a8bozn,4,1,not sure if i need to mark this as possible sp...,1,1,White Christmas Ending
1,1545373691,a877em,4,1,"Correct me if I'm wrong, but Bandersnatch woul...",1,1,How could Bandersnatch's unique choice-based f...
2,1545369058,a86mla,5,1,The purpose of the CP reveal is not there to m...,0,1,A rant about peoples' reactions to SUAD
3,1545362860,a85vmm,5,1,Mine was The Entire History of You. The techno...,0,1,Which episode got you hooked on Black Mirror?
4,1545354878,a84uyn,9,1,For me it wasn’t surprising because I already ...,1,1,Was Shut Up and Dance not surprising to anyone...


## Analysis

In the analysis section, I created a function with the name of 'prep_post'. That enables me to clear the html errors, remove the punctuations and english stop words and finally to stem the words.

### Preparation Function

In [27]:
def prep_post(raw_post, column):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    for i in range(0,len(raw_post[column])):
        # 1. Remove HTML
        review_text = BeautifulSoup(raw_post[column][i]).get_text()
            #
            # 2. Remove non-letters        
        letters_only = re.sub("[^a-zA-Z]", " ", review_text)
           #
            # 3. Convert to lower case, split into individual words
        words = letters_only.lower().split()
            #
            # 4. In Python, searching a set is much faster than searching
            #   a list, so convert the stop words to a set
        stops = set(stopwords.words('english'))
            # 
            # 5. Remove stop words
        meaningful_words = [w for w in words if not w in stops]
            #
            # 6. Stemming
        p_stemmer = PorterStemmer()
        stem_post = [p_stemmer.stem(i) for i in meaningful_words]
            #
            # 7. Join the words back into one string separated by space, 
            # and return the result.
        raw_post[column][i] = (" ".join(stem_post))
    

This function is run on two columns:
- selftext
- title

The reason is that I want to see the prediction capabilities of the models based on text and the titles.

Finally the series dataframe is extracted with a csv file called 'series.csv'

In [60]:
prep_post(series, 'selftext')
prep_post(series, 'title')
# prep_post(X_test, 'selftext')
# prep_post(X_test, 'title')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document


https://play.google.com/music/m/Bd44qbfo64bal2ywbyq424vrb4m?t=Westworld_Season_2_Music_from_the_HBO_Series_-_Ramin_Djawadi" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup

https://imgur.com/QTI07eN" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that 

In [61]:
series.to_csv('series.csv')