##### Problem Statement
 
 Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies..

Import the necessary packages

In [29]:
import numpy as np
import pandas as pd
import re
import nltk
from sklearn.datasets import load_files
nltk.download('stopwords')
import pickle
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\WFS90008514\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Reading in your data and We make a copy of test data so that even if we have to make any changes in this dataset we would not lose the original dataset.

In [30]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
test_original=test.copy()

In [31]:
test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [32]:
train.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


As we can see we have 2 attributes present present in the test set that is ‘message’ and ‘tweetid’ and 3 attributes in the train set that is 'sentiment',‘message’ and ‘tweetid’.

In [33]:
train.sentiment.value_counts()

 1    8530
 2    3640
 0    2353
-1    1296
Name: sentiment, dtype: int64

In [34]:
#linking the dataset
df=pd.concat([train, test])

In [35]:
df.head()

Unnamed: 0,sentiment,message,tweetid
0,1.0,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1.0,It's not like we lack evidence of anthropogeni...,126103
2,2.0,RT @RawStory: Researchers say we have three ye...,698562
3,1.0,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1.0,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [36]:
y = train['sentiment']
X = df['message']

#### Text Preprocessing

In [37]:
def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"") 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)


df['cleanText']=df['message'].map(lambda s:preprocess(s)) 

In [38]:
df

Unnamed: 0,sentiment,message,tweetid,cleanText
0,1.0,PolySciMajor EPA chief doesn't think carbon di...,625221,polyscimajor epa chief think carbon dioxide ma...
1,1.0,It's not like we lack evidence of anthropogeni...,126103,like lack evidence anthropogenic global warming
2,2.0,RT @RawStory: Researchers say we have three ye...,698562,rawstory researchers say three years act clima...
3,1.0,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736,todayinmaker wired pivotal year war climate ch...
4,1.0,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954,soynoviodetodas racist sexist climate change d...
...,...,...,...,...
10541,,"RT @BrittanyBohrer: Brb, writing a poem about ...",895714,brittanybohrer brb writing poem climate change...
10542,,2016: the year climate change came home: Durin...,875167,year climate change came home hottest year rec...
10543,,RT @loop_vanuatu: Pacific countries positive a...,78329,loop_vanuatu pacific countries positive fiji l...
10544,,"RT @xanria_00018: You’re so hot, you must be t...",867455,xanria_ hot must cause global warming aldublab...


In [39]:
train = df[pd.notnull(df['sentiment'])]
 

In [40]:
train

Unnamed: 0,sentiment,message,tweetid,cleanText
0,1.0,PolySciMajor EPA chief doesn't think carbon di...,625221,polyscimajor epa chief think carbon dioxide ma...
1,1.0,It's not like we lack evidence of anthropogeni...,126103,like lack evidence anthropogenic global warming
2,2.0,RT @RawStory: Researchers say we have three ye...,698562,rawstory researchers say three years act clima...
3,1.0,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736,todayinmaker wired pivotal year war climate ch...
4,1.0,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954,soynoviodetodas racist sexist climate change d...
...,...,...,...,...
15814,1.0,RT @ezlusztig: They took down the material on ...,22001,ezlusztig took material global warming lgbt ri...
15815,2.0,RT @washingtonpost: How climate change could b...,17856,washingtonpost climate change could breaking m...
15816,0.0,notiven: RT: nytimesworld :What does Trump act...,384248,notiven nytimesworld trump actually believe cl...
15817,-1.0,RT @sara8smiles: Hey liberals the climate chan...,819732,sarasmiles hey liberals climate change crap ho...


In [41]:
test = df[pd.isnull(df['sentiment'])]

In [42]:
test

Unnamed: 0,sentiment,message,tweetid,cleanText
0,,Europe will now be looking to China to make su...,169760,europe looking china make sure alone fighting ...
1,,Combine this with the polling of staffers re c...,35326,combine polling staffers climate change womens...
2,,"The scary, unimpeachable evidence that climate...",224985,scary unimpeachable evidence climate change al...
3,,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263,karoli morgfair osborneink dailykos putin got ...
4,,RT @FakeWillMoore: 'Female orgasms cause globa...,872928,fakewillmoore female orgasms cause global warm...
...,...,...,...,...
10541,,"RT @BrittanyBohrer: Brb, writing a poem about ...",895714,brittanybohrer brb writing poem climate change...
10542,,2016: the year climate change came home: Durin...,875167,year climate change came home hottest year rec...
10543,,RT @loop_vanuatu: Pacific countries positive a...,78329,loop_vanuatu pacific countries positive fiji l...
10544,,"RT @xanria_00018: You’re so hot, you must be t...",867455,xanria_ hot must cause global warming aldublab...


In [43]:
test = df.drop(['sentiment'],axis=1)

In [44]:
test

Unnamed: 0,message,tweetid,cleanText
0,PolySciMajor EPA chief doesn't think carbon di...,625221,polyscimajor epa chief think carbon dioxide ma...
1,It's not like we lack evidence of anthropogeni...,126103,like lack evidence anthropogenic global warming
2,RT @RawStory: Researchers say we have three ye...,698562,rawstory researchers say three years act clima...
3,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736,todayinmaker wired pivotal year war climate ch...
4,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954,soynoviodetodas racist sexist climate change d...
...,...,...,...
10541,"RT @BrittanyBohrer: Brb, writing a poem about ...",895714,brittanybohrer brb writing poem climate change...
10542,2016: the year climate change came home: Durin...,875167,year climate change came home hottest year rec...
10543,RT @loop_vanuatu: Pacific countries positive a...,78329,loop_vanuatu pacific countries positive fiji l...
10544,"RT @xanria_00018: You’re so hot, you must be t...",867455,xanria_ hot must cause global warming aldublab...


#### Converting Text to Numbers

Bag of Words

In [45]:
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(train['cleanText']).toarray()

In [46]:
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
Xtest = vectorizer.fit_transform(test['cleanText']).toarray()

#### Finding TFIDF

In [47]:
#Term frequency = (Number of Occurrences of a word)/(Total words in the document)
#IDF(word) = Log((Total number of documents)/(Number of documents containing the word))

In [48]:
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

In [49]:
tfidfconverter = TfidfTransformer()
Xtest = tfidfconverter.fit_transform(Xtest).toarray()

Splitting out the X variable from the target

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [67]:
classifier = RandomForestClassifier( n_estimators=10, random_state=10)
classifier.fit(X_train, y_train) 

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=10, verbose=0,
                       warm_start=False)

In [68]:
y_pred = classifier.predict(X_test)

In [69]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

[[   9    0  270    0]
 [   0   13  447    4]
 [   0    0 1688    3]
 [   0    0  681   49]]
              precision    recall  f1-score   support

          -1       1.00      0.03      0.06       279
           0       1.00      0.03      0.05       464
           1       0.55      1.00      0.71      1691
           2       0.88      0.07      0.12       730

    accuracy                           0.56      3164
   macro avg       0.86      0.28      0.24      3164
weighted avg       0.73      0.56      0.42      3164

0.5559418457648546


In [71]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)

In [72]:
from sklearn.metrics import f1_score
f1_score(y_test, rfc_pred, average="macro")

0.583216351386751

Training the model and evaluating using the validation set

Getting our test set ready

In [19]:
testx = test['message']
test_vect = vectorizer.transform(testx)

Making predictions on the test set and adding a sentiment column to our original test df

In [20]:
y_pred = rfc.predict(test_vect)

In [21]:
test['sentiment'] = y_pred

In [22]:
test.head()

Unnamed: 0,message,tweetid,sentiment
0,Europe will now be looking to China to make su...,169760,1
1,Combine this with the polling of staffers re c...,35326,1
2,"The scary, unimpeachable evidence that climate...",224985,1
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263,1
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928,0


Creating an output csv for submission

In [23]:
test[['tweetid','sentiment']].to_csv('testsubmission.csv', index=False)