# Part 2 - Tweets cleaning
## Comments
In this section we are cleaning the tweets, formating it to utf8, removing all `#` and `@` aswell as the links and other useless words or acronyms.
Then we finally save the procesed text of the tweets into a dataframe that is in turn saved as a csv file:
- **twitter_data_clean.csv** : contained in */7-Data/2-CleanTweets/*

## Libraries

In [21]:
# Our main libraries
import numpy as np
import pandas as pd
import sys
import re, string
from os import listdir
from os.path import isfile, join

## Main functions

In [22]:
# We Strip links and entities
def strip_tweet(text):

    # find all links
    links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
    # for each link found
    for link in links:

        # replace the link with a space
        text = text.replace(link, ' ')

    # entity prefixes
    entity_prefixes = ['@','#']
    text1 = text
    text = ' '.join([line.strip() for line in text1.strip().splitlines()])

    # for each word in the tweet
    for idx, word in enumerate(text.split()):

        # for each letter in the word
        for letter in word:

            # if the letter is a @ or #
            if letter in entity_prefixes:

                # replace the word with a space
                text = text.replace(word, ' ')
    
    # Remove various unimportant texts
    
    # We delete the RT that mean Retweeted
    text = text.replace('RT','')
    
    # We delete the text Form
    text = text.replace('Form','')
    
    # We delete Inc since we know we are talking about corporations
    text = text.replace('Inc','')
    
    # We delete App since it is not considered as important
    text = text.replace('App','')
    
    # We delete Alerts since it is not considered as important
    text = text.replace('Alerts','')
    
    # return the processed text
    return text

## We load our companies dataset

In [23]:
# load crypto
crypto = pd.read_csv('./data/crypto.csv', encoding="utf-8")

# discover tweet files
datapath='./data/1-raw-tweets/'
files = [f for f in listdir(datapath) if isfile(join(datapath, f))]

# load tweets
dfs = []

# We go through all the tweets data files in our folder.
for f in files:
    
    # Shows which file is currently processed
    print('Loading {}'.format(f))
    
    # Full filepath
    full_filepath = '{0}/{1}'.format(datapath, f)
    
    # We try to read the files
    try:
        df = pd.read_csv(full_filepath, encoding="utf-8", index_col=0, engine='python')
        
        print(full_filepath)
        
        df['Processed Text'] = df['Text'].apply(strip_tweet)
        
        # We eliminate duplicates tweet
        df = df.drop_duplicates(subset='Processed Text')
        
        dfs.append(df)
    except:
        print('Failed to load %s' % f)

# We contatenate all tweets into one dataframe
df = pd.concat(dfs)

# We check.
print(df.head())

# Shows that it worked.
print('Files Loaded.')

# We save the clean all in one tweets dataframe into a csv
df.to_csv('./data/2-cleaned-tweets/cleaned_tweets.csv', header=True, encoding="utf-8")

print("File Saved.")

print("Done.")

Loading twitter_data_1567761122.csv
./data/1-raw-tweets//twitter_data_1567761122.csv
Loading twitter_data_1567761063.csv
./data/1-raw-tweets//twitter_data_1567761063.csv
Loading twitter_data_1567871525.csv
./data/1-raw-tweets//twitter_data_1567871525.csv
Loading twitter_data_1567761103.csv
./data/1-raw-tweets//twitter_data_1567761103.csv
Loading twitter_data_1567761153.csv
./data/1-raw-tweets//twitter_data_1567761153.csv
Loading twitter_data_1567760999.csv
./data/1-raw-tweets//twitter_data_1567760999.csv
Loading twitter_data_1567761359.csv
./data/1-raw-tweets//twitter_data_1567761359.csv
Loading twitter_data_1567760411.csv
./data/1-raw-tweets//twitter_data_1567760411.csv
Loading twitter_data_1567760665.csv
./data/1-raw-tweets//twitter_data_1567760665.csv
Loading twitter_data_1567760683.csv
./data/1-raw-tweets//twitter_data_1567760683.csv
Loading twitter_data_1567761032.csv
./data/1-raw-tweets//twitter_data_1567761032.csv
     Author Name Company Name Crypto Favorite Count           Mes

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




The file has been saved as *twitter_data_clean.csv* in **/7-Data/2-CleanTweets**

# Now the user may go to Part 3 - Labelling data
First the tweets labelling in order to generate our training data:

- File *labelling_data.ipynb* in folder **3-Filter-Data**

Secondly the training/testing/ prediction of the spam or ham classifier
 
- File *filter_tweets.ipynb* in folder **3-Filter-Data**