## Partie 2: Data Cleaning


**Etapes communes de data cleaning pour tous les textes:**

* Mettre le texte en minuscule ou majuscule
* Retirer la ponctuation
* Retirer les valeurs numériques
* Retirer les symboles
* Retire les stop words

**Etapes du data cleaning après la tokenization:**

* Tokeniser le texte
* Stemming
* Lemmatization



### 1. Import Dataset

In [40]:
import pandas as pd
## for text processing
import re
import nltk
df=pd.read_csv("data.csv")
df.head()

Unnamed: 0,Label,Text
0,cricket,"b""Kumble breaks Kapil's record\n\nFirst Test, ..."
1,cricket,"b""Aussies tighten grip\n\nFirst Test, Perth, d..."
2,cricket,b'Vaughan ready for South Africa\n\nSkipper Mi...
3,cricket,b'World XI triumph in tsunami match\n\nTsunami...
4,cricket,b'Shoaib ruled out of Test series\n\nFast bowl...


In [41]:
#calcul du nombre de stopwords
from nltk.corpus import stopwords
lst_stopwords = nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/davidjeannette/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [42]:
df.Text[0]

'b"Kumble breaks Kapil\'s record\\n\\nFirst Test, Dhaka, day one (stumps): Bangladesh 184 all out v India\\n\\nKumble overtook the mark set by Kapil Dev when he had Mohammad Rafique lbw. And he followed up with a wicket next ball before Bangladesh were bowled out for 184 in 58 overs in Dhaka. After the first session was lost to rain, Irfan Pathan took five wickets to reduce the hosts to 106-7 before Mohammad Ashraful dug in. Ashraful ended unbeaten on 60, having hit six fours and faced 135 balls. Kumble had a chance of a hat-trick after removing Tapash Baisya via a catch at first slip but Mashrafe Mortaza safely defended the fifth ball of his 12th over. But a run out ended the innings not long afterwards.\\n\\nIndia did not get chance to begin their reply as openers Virender Sehwag and Gautam Gambhir were immediately offered the light on stepping to the wicket. India won the toss and Pathan soon got stuck into the top order, dismissing Javed Omar lbw in his second over with one that ni

In [43]:
import re
import string


def clean_text(text):
    
    # text cleaning
    text = re.sub(r'b"',"", text)
    text = text.replace("\\n\\n"," ")
    text = text.replace('\\n',' ')
    text = text.replace("\'", " ")
    text = re.sub(r"\\n\'","", text)
    text = re.sub('\[.*?\]', '', text)                                     
    
    # remove numbers
    text = re.sub(r'\d+', '', text)
                                 
    # Remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)     
    
    #Remove all the special characters
    text = re.sub(r'\W', ' ', str(text))
    
    # remove all single characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text)
    
    # Remove single characters from the start
    text = re.sub(r'\^[a-zA-Z]\s+', ' ', text) 
    
    # Substituting multiple spaces with single space
    text = re.sub(r'\s+', ' ', text, flags=re.I)
    
    # Removing prefixed 'b'
    text = re.sub(r'^b\s+', '', text)
    
    # Convert to lower case
    text = text.lower()  
            
    # Convert to lower case
    text = text.lower()  
    
    return text


In [44]:
df["documents"] = df["Text"].map(lambda x : clean_text(x))
df.head()

Unnamed: 0,Label,Text,documents
0,cricket,"b""Kumble breaks Kapil's record\n\nFirst Test, ...",kumble breaks kapil record first test dhaka da...
1,cricket,"b""Aussies tighten grip\n\nFirst Test, Perth, d...",aussies tighten grip first test perth day thre...
2,cricket,b'Vaughan ready for South Africa\n\nSkipper Mi...,vaughan ready for south africa skipper michael...
3,cricket,b'World XI triumph in tsunami match\n\nTsunami...,world xi triumph in tsunami match tsunami appe...
4,cricket,b'Shoaib ruled out of Test series\n\nFast bowl...,shoaib ruled out of test series fast bowler sh...


In [45]:
df.documents[0]

'kumble breaks kapil record first test dhaka day one stumps bangladesh all out india kumble overtook the mark set by kapil dev when he had mohammad rafique lbw and he followed up with wicket next ball before bangladesh were bowled out for in overs in dhaka after the first session was lost to rain irfan pathan took five wickets to reduce the hosts to before mohammad ashraful dug in ashraful ended unbeaten on having hit six fours and faced balls kumble had chance of hattrick after removing tapash baisya via catch at first slip but mashrafe mortaza safely defended the fifth ball of his th over but run out ended the innings not long afterwards india did not get chance to begin their reply as openers virender sehwag and gautam gambhir were immediately offered the light on stepping to the wicket india won the toss and pathan soon got stuck into the top order dismissing javed omar lbw in his second over with one that nipped back nafis iqbal and rajin saleh were also ajudged lbw by umpire jere

In [46]:
#Stemming avec Tokenisation
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob


ps=PorterStemmer()

df["stemming"] = df["documents"].map(lambda x: " ".join(ps.stem(mot) for mot in x.split()) )
df["tokens_all"] = df["stemming"].map(lambda x: TextBlob(x).words)
df["tokens"] = df["documents"].map(lambda x: TextBlob(x).words)

df.head()

Unnamed: 0,Label,Text,documents,stemming,tokens_all,tokens
0,cricket,"b""Kumble breaks Kapil's record\n\nFirst Test, ...",kumble breaks kapil record first test dhaka da...,kumbl break kapil record first test dhaka day ...,"[kumbl, break, kapil, record, first, test, dha...","[kumble, breaks, kapil, record, first, test, d..."
1,cricket,"b""Aussies tighten grip\n\nFirst Test, Perth, d...",aussies tighten grip first test perth day thre...,aussi tighten grip first test perth day three ...,"[aussi, tighten, grip, first, test, perth, day...","[aussies, tighten, grip, first, test, perth, d..."
2,cricket,b'Vaughan ready for South Africa\n\nSkipper Mi...,vaughan ready for south africa skipper michael...,vaughan readi for south africa skipper michael...,"[vaughan, readi, for, south, africa, skipper, ...","[vaughan, ready, for, south, africa, skipper, ..."
3,cricket,b'World XI triumph in tsunami match\n\nTsunami...,world xi triumph in tsunami match tsunami appe...,world xi triumph in tsunami match tsunami appe...,"[world, xi, triumph, in, tsunami, match, tsuna...","[world, xi, triumph, in, tsunami, match, tsuna..."
4,cricket,b'Shoaib ruled out of Test series\n\nFast bowl...,shoaib ruled out of test series fast bowler sh...,shoaib rule out of test seri fast bowler shoai...,"[shoaib, rule, out, of, test, seri, fast, bowl...","[shoaib, ruled, out, of, test, series, fast, b..."


In [50]:
def remove_stopwords(text):
    
    all_stopwords = stopwords.words('english')
    
    tokens_without_sw = [word for word in text if not word in all_stopwords]
    
    return tokens_without_sw


In [51]:
df["final"] = df["tokens_all"].apply(lambda x :  remove_stopwords(x))

In [52]:
df.head()

Unnamed: 0,Label,Text,documents,stemming,tokens_all,tokens,final
0,cricket,"b""Kumble breaks Kapil's record\n\nFirst Test, ...",kumble breaks kapil record first test dhaka da...,kumbl break kapil record first test dhaka day ...,"[kumbl, break, kapil, record, first, test, dha...","[kumble, breaks, kapil, record, first, test, d...","[kumbl, break, kapil, record, first, test, dha..."
1,cricket,"b""Aussies tighten grip\n\nFirst Test, Perth, d...",aussies tighten grip first test perth day thre...,aussi tighten grip first test perth day three ...,"[aussi, tighten, grip, first, test, perth, day...","[aussies, tighten, grip, first, test, perth, d...","[aussi, tighten, grip, first, test, perth, day..."
2,cricket,b'Vaughan ready for South Africa\n\nSkipper Mi...,vaughan ready for south africa skipper michael...,vaughan readi for south africa skipper michael...,"[vaughan, readi, for, south, africa, skipper, ...","[vaughan, ready, for, south, africa, skipper, ...","[vaughan, readi, south, africa, skipper, micha..."
3,cricket,b'World XI triumph in tsunami match\n\nTsunami...,world xi triumph in tsunami match tsunami appe...,world xi triumph in tsunami match tsunami appe...,"[world, xi, triumph, in, tsunami, match, tsuna...","[world, xi, triumph, in, tsunami, match, tsuna...","[world, xi, triumph, tsunami, match, tsunami, ..."
4,cricket,b'Shoaib ruled out of Test series\n\nFast bowl...,shoaib ruled out of test series fast bowler sh...,shoaib rule out of test seri fast bowler shoai...,"[shoaib, rule, out, of, test, seri, fast, bowl...","[shoaib, ruled, out, of, test, series, fast, b...","[shoaib, rule, test, seri, fast, bowler, shoai..."


In [55]:
# Jeu de données pour features engineering TF_IDF et Bag of Word 
df_cleaning = df.drop(['Text','documents','stemming','tokens_all','tokens'],axis=1)

# Jeu de données pour Latent Dirichlet Allocation (apprentissage non-supervisé)
df_tokens = df.drop(['Text','documents','stemming','tokens_all','final'],axis=1)
df_doc = df.drop(['Text','tokens','stemming','documents','final'],axis=1)

import os
path = os.getcwd()

df_cleaning.to_csv(path + '/data_cleaning.csv', index = False)
df_tokens.to_csv(path + '/data_tokens.csv', index = False)
df_doc.to_csv(path + '/data_doc.csv', index = False)