# Part 3 - Classifying Tweets as spam or ham
## Comments
In this section we are filtering our tweets. To do so we are training a model that performs a classifcation.
Details of the model are mentionned below.

### Data
In order for the classifier to be efficient, we needed good labeled data. First we looked at the free sources data quality. For example : SPAM base on kaggle, SMS spam on UCI machine learning, etc... However, the results weren't good enough. In general, the quality of the free available data wasn't good and specific enough and thus, was leading to poor classification results.
To improve our spam vs ham classifier we needed a solution that would be efficient enough.
We found the solution to be hand labeling the tweets. It would be specific enough to lead to good classification results. But, this was a tedious process. Furthermore it we needed a sufficiently large enough database of hand labeled data. We found a remedy in designing an algorithm that would label this data for us. But how ? Since Twitter was already doing a quite good job of eliminating spam tweets, so there was no need for an extremly accurate algorithm. 

### Labeled data
To do so, we first created a list of spam words. That would help in checking whether the tweet is a spam. Then we gathered tweets from various sectors. With that data, we developped an agorithm that goes through each tweets and sees if one spam words is in the tweets if yes, the tweet is classified as spam.

Overall, our algorithm isn't perfect, a more fined-grained list of spam words would lead to more efficiency but it does perfoms well enough for the purpose of this project.

### Classification
Further below in this section we show how we train our Spam classifier on the labeled data we got from our previous algorithm. 
Then how we classify our freshly gathered clean tweets.

Finally, we'll save our ham tweets in this file:
- **twitter_data_clean_ham.csv** : contained in */7-Data/2-CleanTweets/*

## Libraries

In [9]:
# Libraries
import pandas as pd
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.externals import joblib
from PIL import Image
from wordcloud import WordCloud, STOPWORDS
random.seed(1)

## Supervised Tweets Classification

In Natural Lanugage processing (NLP) in order to classify text, we need a way to transform text into a form that the machine can understand and analyze it. We call it feature extractor. Then we need a way to tell the machine which are spam and which are ham, and the machine will find the way to separate them into two groups. This step is called classification.

In our algorithm, we will use two features extractor, one will be for vectorizing the text into count of each words. These counts will help the machine understand the recurrence of certain words in a spam for example. Then we will apply a tfidf transformer which is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus, in our case the tweet. Example ("Viagra") will have more importance than ("kicking").

We now perform the training of our classifier on training dataset of our good labeled data then we test it on the fresh tweets we gathered.
For efficency and readibility purpose we use scikit-learn pipeline to train our classifier. More information available <a src="http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html"> **here**</a>. In our case it is used to apply subsequent models on a specific data. Since we are applying a *Count Vectorizer*, then a *tfidf Transformer* and finally a *Multinomial Naïve Bayes*

However, we also programmed a more detailed walkthrough that helps the user to understand the steps of our subsequent model. 

Which is available in the file : **training_classifier_details.ipynb** in folder */3-Spam-Filter/*

As depicted on the figure above: we are getting the data from the good labeled dataset (input and label) feeding it to the count vectorizer and the tfidf transformer (our features extractors). Then once the features have been extracted, we are feeding the results into our multinomial Naive Bayes classifier. Which will learn and tell us if it is a spam or a ham.

### Main functions of the classifier
#### Why using pickle file extension ?
- First, because computing a model takes time and computational power. So we'd rather save it in order not to have to compute it a second time. 
- Second, because such models can be huge depending on how they have been trained and the picke file system handles well huge files.
The user is not obliged to re-run the following code and can directly go to the **Using the model to classify part**

In [14]:
def train_classifier():
    """Classifier Trainer"""
    
    # We load training data
    data = pd.read_csv("./data/twitter_spam_trainer.csv", encoding="utf-8", header=0)
    
    # We format the columns
    data.columns = ['Text','SpamOrHam']
    
    # If some are missing we just don't count them as spam
    data = data.fillna("spam")
    
    # We split the data into training and testing data
    msk_data = np.random.rand(len(data)) < 0.8
    train = data[msk_data]
    test = data[~msk_data]
    
    # We define our classifier pipeline
    print('Training classifier...')
    cl = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB())])
    
    # We fit our training data into the scikit pipeline : countvectorizer => tfidf transform => multinomialNB
    cl = cl.fit(train.Text, train.SpamOrHam)    
    print('Classifier trained')
    
    print('Computing score...')
    
    # We print the score
    predicted = cl.predict(test.Text)  
    print('Accuracy: {}'.format(np.mean(predicted == test['SpamOrHam'])))
    
    print('Saving model...')
  
    # saving model
    joblib.dump(cl, './data/twitter_sentiment_model_spam.pkl')
    print('Model saved.') 

## Training the model and testing it
We can see that our model has around 89% accuracy

In [15]:
train_classifier()

Training classifier...
Classifier trained
Computing score...
Accuracy: 0.8962002783036016
Saving model...
Model saved.


## Using the model to classify

### Functions

In [21]:
def load_model():
    print("Loading model...")
    # to load back in
    cl = joblib.load('./data/twitter_sentiment_model_spam.pkl')
    print("Done.")
    return cl

def predict_dataset(tweets, cl):
    """Predict if the tweets are spam or not
    
    Arguments:
        tweets {string} -- tweets
        cl {model pipeline} -- Model that is used to predict spam or ham
    
    Returns:
        DataFrame -- final predicted results
    """

    # Predict dataset
    print('Pedict dataset...')

    # We predict
    predicted = cl.predict(tweets['Processed Text'].values.astype('U'))
    
    # We compound Text and predicted into a single dataframe
    tweets['Predicted'] = predicted

    print("Done.")

    # Return the final dataframe
    return tweets

def create_cloud(tweets):
    """Genereate a cloud of words that shows what are the most recurring words.
    
    Arguments:
        tweets {[string]} -- list of tweets.
    """

    stopwords = set(STOPWORDS)
    wc = WordCloud(background_color="white",
        max_words=200,
        stopwords=stopwords, 
        width=800, 
        height=400)
    wc.generate(tweets)
    wc.to_file('./data/0-graphs/wordcloud.png')

def load_tweets():
    print("Loading tweets...")
    df = pd.read_csv('./data/2-cleaned-tweets/cleaned_tweets.csv',  encoding='utf-8', index_col=0)
    print("Done.")
    return df  

def save_dataset(df):
    print("Saving dataset...")
    df.to_csv('./data/2-cleaned-tweets/twitter_data_clean_ham.csv')
    print("Done. \nSaved as : twitter_data_clean_ham.csv ")

def clean_tweets(df):
    # We eliminate nan fields
    df = df[pd.notnull(df['Processed Text'])]

    # We eliminate doubles
    df = df.drop_duplicates('Message ID')
    df = df.drop_duplicates('Processed Text')
    return df

### Main 
we run classification

In [22]:
# main program
if __name__ == '__main__':

    # train_classifier()

    # We load the model
    model_cl = load_model()

    # We load the tweets
    df_tweets = load_tweets()

    # We gather results.
    results = predict_dataset(df_tweets, model_cl)

    results = clean_tweets(results)

    # Spam Or Ham
    spam = results['Predicted']=="spam"
    ham = results['Predicted']=="ham"

    # We print out the results.
    print("-"*50)
    print('Spam number :{0}\t|\tHam number :{1}'.format(len(results[spam]['Predicted']),len(results[ham]['Predicted'])))  
    print("-"*50)
    
    # We generate a wordcloud of the most recurrent words.
    print("Generating wordcloud...")
    create_cloud(results[ham]['Processed Text'].to_csv(encoding='utf-8', sep=' ', index=False, header=False))
    print("Done. \nSaved as {0}".format('wordcloud.png'))
    
    # We save our dataset as twitter_data_clean_ham.csv
    save_dataset(results[ham])

Loading model...
Done.
Loading tweets...
Done.
Pedict dataset...
Done.
--------------------------------------------------
Spam number :90	|	Ham number :101
--------------------------------------------------
Generating wordcloud...
Done. 
Saved as wordcloud.png
Saving dataset...
Done. 
Saved as : twitter_data_clean_ham.csv 


In [29]:
results[ham].sample(5)  

Unnamed: 0,Company Name,Author Name,Text,Message ID,Published At,Retweet Count,Favorite Count,Processed Text,Predicted
10,SXP,YAM,(完全に描く骨の量を減らしたい描き手の都合も多分に含まれます),1169866657629532160,2019-09-06 06:55:20,0,0,(完全に描く骨の量を減らしたい描き手の都合も多分に含まれます),ham
7,EDO,氷結こんぶ,@t_ogiri2 「南蛮か！」のネタが放送コードにかかった\n【大喜利】江戸時代からタイム...,1169897578239905792,2019-09-06 08:58:12,0,0,「南蛮か！」のネタが放送コードにかかった 【大喜利】江戸時代からタイムスリップしたお侍漫...,ham
8,EDO,KoroGMoraza,"RT @BakartxoR: Hondarribian, @Jaizkibel6-aren ...",1169897453534896132,2019-09-06 08:57:43,19,0,"Hondarribian, desfile entsaioan, konpaini...",ham
8,CRPT,Coin Trading Analytics,"Top 100 avg 1h return: 0.1±1.0%; 55 up, 45 dow...",1169867886887702529,2019-09-06 07:00:13,0,0,"Top 100 avg 1h return: 0.1±1.0%; 55 up, 45 dow...",ham
0,SXP,ぬしえ,@yy109810 ゲレンデが溶けるほど恋したい,1169887761102360576,2019-09-06 08:19:12,0,0,ゲレンデが溶けるほど恋したい,ham


In [28]:
results[spam].sample(5)

Unnamed: 0,Company Name,Author Name,Text,Message ID,Published At,Retweet Count,Favorite Count,Processed Text,Predicted
0,CRPT,Energiwik,@Louna_Crpt @JulieClerissy Si si justement fai...,1169886961508610049,2019-09-06 08:16:01,0,0,Si si justement fais attention à toi 😈,spam
1,DROP,CryptoAdvocate 🇦🇺,Always great to see a new #MCO #VISA Card reci...,1169885979613614080,2019-09-06 08:12:07,1,1,Always great to see a new Card recipient p...,spam
0,NEXO,ᱬ; ꪝᥲꪀdᥲ ℳ (Menciones|Disney! Au)୭̥ೄ,"@Rusbrk ᱬⵓ “ No, no me molesta, depende de com...",1169899045034180610,2019-09-06 09:04:02,0,0,"ᱬⵓ “ No, no me molesta, depende de como lo d...",spam
6,NEXO,MegaNerd,@dynitstaff e @Nexo_Digital continuano a porta...,1169895570380476416,2019-09-06 08:50:14,0,0,e continuano a portare gli nei cinema it...,spam
8,EDO,Om edo,@BuuloloArianto Berkali-kali studi banding ke ...,1169898699901718531,2019-09-06 09:02:40,0,0,Berkali-kali studi banding ke luar negeri hs...,spam


# Part 3 - Conclusion
To conclude, we sadly, can see that our algorithm isn't perfect. We realized the importance of good data. Machine learning algorithms only performs well when there is high quality in the data. Furthermore, we looked for freely available data and did not take any costly solution into account. To improve the accuracy of our algorithm one solution would be to better understand how a spam is structured and find characteristics that differentiate it to a ham. That could be done by adding additional requirements that a tweet should pass in order to be classified as ham. Such as for example **weird characters**, **smileys**, **SMS words**, **better filter for languages**, **etc...**

Our main concern would reside in the fact that we are more **interested** in elminating the **false negative** than we are in eliminating **false positive**. As example, that would matter more that a **spam is classified as ham** than the inverse. Our **false negative** ham would lead to a **bias** in our sentiment analysis. 

# Now the user may go to Part - 4 Sentiment Analysis
File *sentiment_analysis.ipynb* in folder **4-Sentiment**