# Master Project

This notebook contains the totality of code used for my Master Project.

## Table of contents

# Data collection

I used the `twitterscraper` package with the following query.

```
twitterscraper volkswagen -bd 2015-09-01 -ed 2015-10-15 -o 01_09_2015-15_10_2015.json
```

In [16]:
import re
import json

import pandas as pd

from nltk.tokenize import WordPunctTokenizer

# Data cleaning

Let's open the raw data.

In [2]:
%%time

with open('01_09_2015-15_10_2015.json', 'r') as file:
    raw = json.load(file)

df = pd.DataFrame(raw)

CPU times: user 30.9 s, sys: 43.2 s, total: 1min 14s
Wall time: 2min 8s


We have collected **914 274** tweets.

In [3]:
print('Number of tweets collected: {0}'.format(len(raw)))

Number of tweets collected: 914274


A tweet looks like this.

In [4]:
raw[0]

{'fullname': 'Steve Kelly',
 'html': '<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en">Out here with my twin! <a class="twitter-atreply pretty-link js-nav" data-mentioned-user-id="61401350" dir="ltr" href="/Big_Euro"><s>@</s><b>big_euro</b></a> <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/socaleuro?src=hash"><s>#</s><b>socaleuro</b></a> <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Volkswagen?src=hash"><s>#</s><b><strong>Volkswagen</strong></b></a> <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/b7passat?src=hash"><s>#</s><b>b7passat</b></a> <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/passat?src=hash"><s>#</s><b>passat</b></a> <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" hr

To conduct our analysis, we will only keep the `id` and the `text` columns. Later on, we might rely on likes, replies and retweets to give importance to tweets.

In [5]:
df.drop(columns=['fullname', 'html', 'likes', 'replies', 'retweets', 'timestamp', 'url', 'user'], inplace=True)
df.head()

Unnamed: 0,id,text
0,640675797720801280,Out here with my twin! @big_euro #socaleuro #V...
1,640675779253280768,I liked a @YouTube video http://youtu.be/pmzZb...
2,640675589251330048,Triste que ahora estacionen un Volkswagen que ...
3,640675540794515456,I liked a @YouTube video http://youtu.be/D-He3...
4,640675381922656256,I liked a @YouTube video http://youtu.be/8ZhGp...


We removed the links.

In [6]:
print('Original:')
print(df['text'][1])
print(df['text'][18])
print('Transformed:')
print(re.sub(r'(https?:\/\/|pic.twitter)[^\s]+', '', df['text'][1]))
print(re.sub(r'(https?:\/\/|pic.twitter)[^\s]+', '', df['text'][18]))

Original:
I liked a @YouTube video http://youtu.be/pmzZbUioFAQ?a  2015 Volkswagen Sales Event | “Model Rear End” Passat Commercial
Volkswagen pic.twitter.com/hPnvGJ3Ysi
Transformed:
I liked a @YouTube video   2015 Volkswagen Sales Event | “Model Rear End” Passat Commercial
Volkswagen 


We remove the user tags or hastags but not the text associated with it.

In [7]:
print('Original:')
print(df['text'][0])
print('Transformed:')
print(re.sub(r'(#|@)', '', df['text'][0]))

Original:
Out here with my twin! @big_euro #socaleuro #Volkswagen #b7passat #passat #static #bagride… http://ift.tt/1JZMPmk pic.twitter.com/mVowUAs7Xv
Transformed:
Out here with my twin! big_euro socaleuro Volkswagen b7passat passat static bagride… http://ift.tt/1JZMPmk pic.twitter.com/mVowUAs7Xv


Let's write a function that sums up all the cleaning.

In [25]:
def cleaner(df):
    # Create a copy
    df_tmp = df.copy()
    
    # Links cleaning
    df_tmp.text = df_tmp.text.str.replace(r'(https?:\/\/|pic.twitter)[^\s]+', '')
    
    # @ and # cleaning
    df_tmp.text = df_tmp.text.str.replace(r'(#|@)', '')
    
    # Remove the numbers
    df_tmp.text = df_tmp.text.str.replace(r'\d+', '')
    
    # Put everything to lower case
    df_tmp.text = df_tmp.text.str.lower()
    
    # Punctuation
    df_tmp.text = df_tmp.text.str.replace(r'[^\w\s]', ' ')
    
    # Strip whitespaces
    df_tmp.text = df_tmp.text.str.replace(r'\s{2,}', ' ')
    
    return df_tmp

In [26]:
df_clean = cleaner(df)

In [27]:
for i in range(30):
    print(df.text[i])

Out here with my twin! @big_euro #socaleuro #Volkswagen #b7passat #passat #static #bagride… http://ift.tt/1JZMPmk pic.twitter.com/mVowUAs7Xv
I liked a @YouTube video http://youtu.be/pmzZbUioFAQ?a  2015 Volkswagen Sales Event | “Model Rear End” Passat Commercial
Triste que ahora estacionen un Volkswagen que no es mio frente a mi casa.
I liked a @YouTube video http://youtu.be/D-He3Gq43DA?a  Diesel Old Wives' Tale #4: Stinky | 2015 Volkswagen Passat TDI Clean Diesel
I liked a @YouTube video http://youtu.be/8ZhGpZaLC10?a  Diesel Old Wives' Tale #2: Loud | 2015 Volkswagen Passat TDI Clean Diesel
I liked a @YouTube video http://youtu.be/YnYuCVhRgK8?a  2015 Volkswagen Sales Event | “Rear” Passat Commercial
04 Volkswagen Jetta: '04 Volkswagen Jetta in good condition with only 76k miles. 4 door sedan ... http://bit.ly/1Op01oK  #VW #jetta #car
2015 Volkswagen Jetta Sedan: Looking for a 2015 Volkswagen Jetta Sedan located in Merrimack NH... http://bit.ly/1XzoqhH  #VW #jetta #car
I liked a @YouTub

In [28]:
for i in range(30):
    print(df_clean.text[i])

out here with my twin big_euro socaleuro volkswagen bpassat passat static bagride 
i liked a youtube video volkswagen sales event model rear end passat commercial
triste que ahora estacionen un volkswagen que no es mio frente a mi casa 
i liked a youtube video diesel old wives tale stinky volkswagen passat tdi clean diesel
i liked a youtube video diesel old wives tale loud volkswagen passat tdi clean diesel
i liked a youtube video volkswagen sales event rear passat commercial
 volkswagen jetta volkswagen jetta in good condition with only k miles door sedan vw jetta car
 volkswagen jetta sedan looking for a volkswagen jetta sedan located in merrimack nh vw jetta car
i liked a youtube video diesel old wives tale hard to find volkswagen passat tdi clean diesel
i liked a youtube video volkswagen sales event many miles passat commercial
hoje eu to feliiiiiz hoje eu to contente o o fusca vw vwaircooled käfer volkswagen car instaca 
volkswagen fixa preço de revisões incluindo mão de obra carr