# Part 2 - Tweets cleaning
## Comments
In this section we are cleaning the tweets, formating it to utf8, removing all `#` and `@` aswell as the links and other useless words or acronyms.
Then we finally save the procesed text of the tweets into a dataframe that is in turn saved as a csv file:
- **twitter_data_clean.csv** : contained in */7-Data/2-CleanTweets/*

## Libraries

In [5]:
# Our main libraries
import numpy as np
import pandas as pd
import sys
import re, string
from os import listdir
from os.path import isfile, join

## Main functions

In [6]:
# We Strip links and entities
def strip_tweet(text):

    # find all links
    links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
    # for each link found
    for link in links:

        # replace the link with a space
        text = text.replace(link, ' ')

    # entity prefixes
    entity_prefixes = ['@','#']
    text1 = text
    text = ' '.join([line.strip() for line in text1.strip().splitlines()])

    # for each word in the tweet
    for idx, word in enumerate(text.split()):

        # for each letter in the word
        for letter in word:

            # if the letter is a @ or #
            if letter in entity_prefixes:

                # replace the word with a space
                text = text.replace(word, ' ')
    
    # Remove various unimportant texts
    
    # We delete the RT that mean Retweeted
    text = text.replace('RT','')
    
    # We delete the text Form
    text = text.replace('Form','')
    
    # We delete Inc since we know we are talking about corporations
    text = text.replace('Inc','')
    
    # We delete App since it is not considered as important
    text = text.replace('App','')
    
    # We delete Alerts since it is not considered as important
    text = text.replace('Alerts','')
    
    # return the processed text
    return text

## We load our companies dataset

In [7]:
# load companies
companies = pd.read_csv('./data/crypto.csv',index_col=0, encoding="utf-8")

companies.index = companies['Name']

# discover tweet files
datapath='../data/'
files = [f for f in listdir(datapath) if isfile(join(datapath, f))]

# load tweets
dfs = []

# We go through all the tweets data files in our folder.
for f in files:
    
    # Shows which file is currently processed
    print('Loading {}'.format(f))
    
    # Full filepath
    full_filepath = '{0}/{1}'.format(datapath, f)
    
    # We try to read the files
    try:
        df = pd.read_csv(full_filepath, encoding="utf-8", index_col=0, engine='python')
        
        print(full_filepath)
        
        df['Processed Text'] = df['Text'].apply(strip_tweet)
        
        # We eliminate duplicates tweet
        df = df.drop_duplicates(subset='Processed Text')
        
        dfs.append(df)
    except:
        print('Failed to load %s' % f)

# We contatenate all tweets into one dataframe
df = pd.concat(dfs)

# We check.
print(df.head())

# Shows that it worked.
print('Files Loaded.')

# We save the clean all in one tweets dataframe into a csv
df.to_csv('../7-Data/2-CleanTweets/twitter_data_clean.csv', header=True, encoding="utf-8")

print("File Saved.")

print("Done.")

Loading twitter_data.csv
../7-Data/1-RawTweets/twitter_data.csv
Failed to load twitter_data.csv
Loading twitter_data_1527428653.csv
../7-Data/1-RawTweets/twitter_data_1527428653.csv
Loading twitter_data_1527456526.csv
../7-Data/1-RawTweets/twitter_data_1527456526.csv
Loading twitter_data_1527693686.csv
../7-Data/1-RawTweets/twitter_data_1527693686.csv
Loading twitter_data_1527693856.csv
../7-Data/1-RawTweets/twitter_data_1527693856.csv
Loading twitter_data_1527856125.csv
../7-Data/1-RawTweets/twitter_data_1527856125.csv
Loading twitter_data_1527873917.csv
../7-Data/1-RawTweets/twitter_data_1527873917.csv
Loading twitter_data_1527949217.csv
../7-Data/1-RawTweets/twitter_data_1527949217.csv
Loading twitter_data_1527952342.csv
../7-Data/1-RawTweets/twitter_data_1527952342.csv
Loading twitter_data_1528037772.csv
../7-Data/1-RawTweets/twitter_data_1528037772.csv
             Company Name            Author Name  \
0  3D Systems Corporation  takai backpack N k.L.   
1  3D Systems Corporation 

The file has been saved as *twitter_data_clean.csv* in **/7-Data/2-CleanTweets**

# Now the user may go to Part 3 - Labelling data
First the tweets labelling in order to generate our training data:

- File *labelling_data.ipynb* in folder **3-Filter-Data**

Secondly the training/testing/ prediction of the spam or ham classifier
 
- File *filter_tweets.ipynb* in folder **3-Filter-Data**