In [15]:
import csv 
import pandas as pd
import nltk
import os
import warnings
warnings.filterwarnings('ignore')

Now we want to preprocess the tweets accordingly to their languages.

In [7]:
#define the languages we want to preprocess
lang_list = ['en', 'es', 'pt']

# define what each iso-639 code means
lang_dict = {
    'en': 'english',
    'es': 'spanish',
    'pt': 'portugese'
}

clean_data_dir = './clean_data'

We will preprocess English first. Let's import the csv file into a Pandas DataFrame

In [11]:
fulldir = os.path.join(clean_data_dir, 'en' + '.csv')   

df_eng = pd.read_csv(fulldir)
df_eng

Unnamed: 0.1,Unnamed: 0,clean_tweets,prediction,language,confidence
0,3,I felt my first flash of violence at some fool...,"('en', 0.9772645831108093)",en,0.977265
1,4,Ladies drink and get in free till,"('en', 0.6527988910675049)",en,0.652799
2,7,Watching Miranda On bbc mermhart u r HILARIOUS,"('en', 0.5819909572601318)",en,0.581991
3,9,Shopping Kohls httptcoIZkQHT,"('en', 0.5320528745651245)",en,0.532053
4,16,Dennycrowe all over twitter because you and yo...,"('en', 0.768022358417511)",en,0.768022
...,...,...,...,...,...
4307,10492,Another Cardigan Records Hopscotch Day Party i...,"('en', 0.920673131942749)",en,0.920673
4308,10493,Im at Hempstead Hair World in Elmont NY httpst...,"('en', 0.5241799354553223)",en,0.524180
4309,10494,Bachelorette Laurita Winery httpstcoBsIIFmdGz,"('en', 0.6609522104263306)",en,0.660952
4310,10496,This job might be a great fit for you Sr Infor...,"('en', 0.6981475949287415)",en,0.698148


Since we just want the `clean_tweets` we can just omit other columns.

In [12]:
df_eng_clean_tweets = df_eng[['clean_tweets']]
df_eng_clean_tweets

Unnamed: 0,clean_tweets
0,I felt my first flash of violence at some fool...
1,Ladies drink and get in free till
2,Watching Miranda On bbc mermhart u r HILARIOUS
3,Shopping Kohls httptcoIZkQHT
4,Dennycrowe all over twitter because you and yo...
...,...
4307,Another Cardigan Records Hopscotch Day Party i...
4308,Im at Hempstead Hair World in Elmont NY httpst...
4309,Bachelorette Laurita Winery httpstcoBsIIFmdGz
4310,This job might be a great fit for you Sr Infor...


Firstly, we want to lowercase everything.

In [16]:
df_eng_clean_tweets['lowercase'] = df_eng_clean_tweets['clean_tweets'].apply(lambda x: x.lower())
df_eng_clean_tweets

Unnamed: 0,clean_tweets,lowercase
0,I felt my first flash of violence at some fool...,i felt my first flash of violence at some fool...
1,Ladies drink and get in free till,ladies drink and get in free till
2,Watching Miranda On bbc mermhart u r HILARIOUS,watching miranda on bbc mermhart u r hilarious
3,Shopping Kohls httptcoIZkQHT,shopping kohls httptcoizkqht
4,Dennycrowe all over twitter because you and yo...,dennycrowe all over twitter because you and yo...
...,...,...
4307,Another Cardigan Records Hopscotch Day Party i...,another cardigan records hopscotch day party i...
4308,Im at Hempstead Hair World in Elmont NY httpst...,im at hempstead hair world in elmont ny httpst...
4309,Bachelorette Laurita Winery httpstcoBsIIFmdGz,bachelorette laurita winery httpstcobsiifmdgz
4310,This job might be a great fit for you Sr Infor...,this job might be a great fit for you sr infor...


Then we remove the stopwords.

In [23]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words(lang_dict['en'])

df_eng_clean_tweets['stopwords_removed'] = df_eng_clean_tweets['lowercase'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))
df_eng_clean_tweets

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bingyuyap/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,clean_tweets,lowercase,stopwords_removed
0,I felt my first flash of violence at some fool...,i felt my first flash of violence at some fool...,felt first flash violence fool bumped pity fool
1,Ladies drink and get in free till,ladies drink and get in free till,ladies drink get free till
2,Watching Miranda On bbc mermhart u r HILARIOUS,watching miranda on bbc mermhart u r hilarious,watching miranda bbc mermhart u r hilarious
3,Shopping Kohls httptcoIZkQHT,shopping kohls httptcoizkqht,shopping kohls httptcoizkqht
4,Dennycrowe all over twitter because you and yo...,dennycrowe all over twitter because you and yo...,dennycrowe twitter friends cant stick
...,...,...,...
4307,Another Cardigan Records Hopscotch Day Party i...,another cardigan records hopscotch day party i...,another cardigan records hopscotch day party b...
4308,Im at Hempstead Hair World in Elmont NY httpst...,im at hempstead hair world in elmont ny httpst...,im hempstead hair world elmont ny httpstcohvyg...
4309,Bachelorette Laurita Winery httpstcoBsIIFmdGz,bachelorette laurita winery httpstcobsiifmdgz,bachelorette laurita winery httpstcobsiifmdgz
4310,This job might be a great fit for you Sr Infor...,this job might be a great fit for you sr infor...,job might great fit sr information architect s...


I will be using Lemmatizer instead of Stemmer because Stemmer could lead to mispelled words and this will cause duplicated tokens for a supposedly same word. However, I don't think this is scalable as we increase the scope of the languages as there might not be Lemmatizer / good Lemmatizers for specific languages. That introduces the need for language specific preprocessing.

Considerations:
1. Speed - while Lemmatizer is usually slower than Stemmer, it does a better job in getting the actual word / meaning of the tokens. So here is a tradeoff introduced.
2. Preprocess all the data source fairly - I want to create a pipeline to preprocess all languages the same way. For this assignment purpose, the SpaCy Lemmatizer used supports all three languages just fine. However, when it comes to scaling to other languages, we need to consider preprocessing each language differently.

In [35]:
!python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 2.2 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.1.0
You should consider upgrading via the '/opt/homebrew/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [37]:
# pip install spacy
import spacy

nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
df_eng_clean_tweets['lemmatized'] = df_eng_clean_tweets['stopwords_removed'].apply(lambda x: " ".join([token.lemma_ for token in nlp(x)]))
df_eng_clean_tweets.sample(5)

Unnamed: 0,clean_tweets,lowercase,stopwords_removed,lemmatized
778,TheOnlySarahh yall did great tonight,theonlysarahh yall did great tonight,theonlysarahh yall great tonight,theonlysarahh y all great tonight
3885,Retail Job in SanCarlos CA store manager San ...,retail job in sancarlos ca store manager san ...,retail job sancarlos ca store manager san carl...,retail job sancarlo ca store manager san carlo...
2107,raeboze ImVeryIndian kenziehoneybutt best nigh...,raeboze imveryindian kenziehoneybutt best nigh...,raeboze imveryindian kenziehoneybutt best nigh...,raeboze imveryindian kenziehoneybutt good nigh...
497,Football talk with the boys gtgtgtgtgt,football talk with the boys gtgtgtgtgt,football talk boys gtgtgtgtgt,football talk boy gtgtgtgtgt
3033,Melbourne Demons Training Victory as well Go...,melbourne demons training victory as well go...,melbourne demons training victory well goschs ...,melbourne demon train victory well goschs padd...


After we lemmatized the tweets, we can now vectorize the tokens.

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

n_features = 1000

# Use tf-idf features for NMF & SVD.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=5, max_features=n_features, stop_words='english', smooth_idf=True)
tfidf = tfidf_vectorizer.fit_transform(df_eng_clean_tweets['lemmatized'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=5,  max_features=n_features, stop_words='english')
tf = tf_vectorizer.fit_transform(df_eng_clean_tweets['lemmatized'])
tf_feature_names = tf_vectorizer.get_feature_names()

print(tfidf.shape) # check shape of the document-term matrix

ModuleNotFoundError: No module named 'sklearn'

In [47]:
pip install --no-cache --no-binary :all: --no-use-pep517 scipy"==1.7.1"
pip install --no-use-pep517 scikit-learn"==0.24.2"

SyntaxError: invalid syntax (<ipython-input-47-e6eaded1ce3b>, line 1)