# Team 4: Climate Change Belief Analysis

## Contents 

#### 1. Introduction
#### 2. Import libraries 
#### 3. Import datasets
#### 4. Data cleaning and preprocessing
#### 5. Exploratory data analysis
#### 6. Feature engineering and selection
#### 7. Model building
#### 8. Results interpretation
#### 9. Conclusion

## 1. Introduction

### Problem landscape

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

### Aim

The aim of this project is to find out if people believe in climate change or not . And this will be done by view people's previous sentiments when it comes to Climate Change. This will give an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic catergories. Thus increasing their insights and informing future marketing strategies.

### Problem statement 

 Building a Machine learning model that is able to classify whether or not a person believes in climate change , based on their novel tweet data

## 2. Importing libraries 

In [2]:
#Deploying the model
from comet_ml import Experiment

In [3]:
# Api key
experiment = Experiment(api_key="NMdrE2Fvv00bzfhwE99pCjGSq",
                        project_name="team-4-climate-change", workspace="primmk")

COMET INFO: old comet version (3.1.10) detected. current: 3.1.11 please update your comet lib with command: `pip install --no-cache-dir --upgrade comet_ml`
COMET INFO: Experiment is live on comet.ml https://www.comet.ml/primmk/team-4-climate-change/3efc36f0795645a885738c0ea138c5db



In [2]:
#Standard Imports
import numpy as np
import pandas as pd
import re

#Visualisations 
import matplotlib.pyplot as plt
import seaborn as sns 

#DATA CLEANING
from nltk.stem import PorterStemmer
import nltk
from nltk.corpus import stopwords
from textblob import Word

#Deploying the model
from comet_ml import Experiment

#MODELLING
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics 
from sklearn.svm import LinearSVC
from sklearn.svm import SVC

ModuleNotFoundError: No module named 'textblob'

## 3. Importing datasets

In [None]:
df_train=pd.read_csv(r"C:\Users\ramuk\Downloads\climate-change-belief-analysis\train.csv")
df_test =pd.read_csv(r"C:\Users\ramuk\Downloads\climate-change-belief-analysis\test.csv")

In [6]:
df_train.head()
print('Dataset size:', df_train.shape)

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [7]:
df_test.head()
print('Dataset size:', df_test.shape)

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [None]:
#Checking data info on datasets
df_train.info()

In [None]:
df_test()

## 3. Data cleaning and preprocessing

### 3.1) Transforming text to lowercase

In [8]:
df_train['message'] = df_train['message'].apply(lambda x: " ".join(x.lower()
for x in x.split()))
df_train['message']

0        polyscimajor epa chief doesn't think carbon di...
1        it's not like we lack evidence of anthropogeni...
2        rt @rawstory: researchers say we have three ye...
3        #todayinmaker# wired : 2016 was a pivotal year...
4        rt @soynoviodetodas: it's 2016, and a racist, ...
                               ...                        
15814    rt @ezlusztig: they took down the material on ...
15815    rt @washingtonpost: how climate change could b...
15816    notiven: rt: nytimesworld :what does trump act...
15817    rt @sara8smiles: hey liberals the climate chan...
15818    rt @chet_cannon: .@kurteichenwald's 'climate c...
Name: message, Length: 15819, dtype: object

In [9]:
df_test['message'] = df_test['message'].apply(lambda x: " ".join(x.lower()
for x in x.split()))
df_test['message']

0        europe will now be looking to china to make su...
1        combine this with the polling of staffers re c...
2        the scary, unimpeachable evidence that climate...
3        @karoli @morgfair @osborneink @dailykos putin ...
4        rt @fakewillmoore: 'female orgasms cause globa...
                               ...                        
10541    rt @brittanybohrer: brb, writing a poem about ...
10542    2016: the year climate change came home: durin...
10543    rt @loop_vanuatu: pacific countries positive a...
10544    rt @xanria_00018: you’re so hot, you must be t...
10545    rt @chloebalaoing: climate change is a global ...
Name: message, Length: 10546, dtype: object

### 3.2) Removing the punctuation

In [10]:
df_train['message'] = df_train['message'].str.replace('[^\w\s]','')
df_train['message']

0        polyscimajor epa chief doesnt think carbon dio...
1        its not like we lack evidence of anthropogenic...
2        rt rawstory researchers say we have three year...
3        todayinmaker wired  2016 was a pivotal year in...
4        rt soynoviodetodas its 2016 and a racist sexis...
                               ...                        
15814    rt ezlusztig they took down the material on gl...
15815    rt washingtonpost how climate change could be ...
15816    notiven rt nytimesworld what does trump actual...
15817    rt sara8smiles hey liberals the climate change...
15818    rt chet_cannon kurteichenwalds climate change ...
Name: message, Length: 15819, dtype: object

In [11]:
df_test['message'] = df_train['message'].str.replace('[^\w\s]','')
df_test['message']

0        polyscimajor epa chief doesnt think carbon dio...
1        its not like we lack evidence of anthropogenic...
2        rt rawstory researchers say we have three year...
3        todayinmaker wired  2016 was a pivotal year in...
4        rt soynoviodetodas its 2016 and a racist sexis...
                               ...                        
10541    ecowas says addressing climate change issues w...
10542    sarahartman my god hes going to build a wall p...
10543    rt sensanders if we dont address climate chang...
10544    im wearing a jean jacket during winter global ...
10545    rt awuillermin you know what im so stoked on t...
Name: message, Length: 10546, dtype: object

### 3.3) Removing stopwords

In [12]:
#remove stop words
stop = stopwords.words('english')
df_train['message'] = df_train['message'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df_train['message']

0        polyscimajor epa chief doesnt think carbon dio...
1          like lack evidence anthropogenic global warming
2        rt rawstory researchers say three years act cl...
3        todayinmaker wired 2016 pivotal year war clima...
4        rt soynoviodetodas 2016 racist sexist climate ...
                               ...                        
15814    rt ezlusztig took material global warming lgbt...
15815    rt washingtonpost climate change could breakin...
15816    notiven rt nytimesworld trump actually believe...
15817    rt sara8smiles hey liberals climate change cra...
15818    rt chet_cannon kurteichenwalds climate change ...
Name: message, Length: 15819, dtype: object

In [13]:
stop = stopwords.words('english')
df_test['message'] = df_train['message'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df_test['message']

0        polyscimajor epa chief doesnt think carbon dio...
1          like lack evidence anthropogenic global warming
2        rt rawstory researchers say three years act cl...
3        todayinmaker wired 2016 pivotal year war clima...
4        rt soynoviodetodas 2016 racist sexist climate ...
                               ...                        
10541    ecowas says addressing climate change issues e...
10542    sarahartman god hes going build wall promote w...
10543    rt sensanders dont address climate change ther...
10544    im wearing jean jacket winter global warming b...
10545    rt awuillermin know im stoked fact people elec...
Name: message, Length: 10546, dtype: object

### 3.4) Stemming

In [15]:
st = PorterStemmer()
df_train['message'][:5].apply(lambda x: " ".join([st.stem(word) for
word in x.split()]))

0    polyscimajor epa chief doesnt think carbon dio...
1               like lack evid anthropogen global warm
2    rt rawstori research say three year act climat...
3    todayinmak wire 2016 pivot year war climat cha...
4    rt soynoviodetoda 2016 racist sexist climat ch...
Name: message, dtype: object

In [16]:
st = PorterStemmer()
df_test['message'][:5].apply(lambda x: " ".join([st.stem(word) for
word in x.split()]))

0    polyscimajor epa chief doesnt think carbon dio...
1               like lack evid anthropogen global warm
2    rt rawstori research say three year act climat...
3    todayinmak wire 2016 pivot year war climat cha...
4    rt soynoviodetoda 2016 racist sexist climat ch...
Name: message, dtype: object

### 3.5) Lemmetization

In [18]:
#Code for lemmatize
df_train['message'] = df_train['message'].apply(lambda x: " ".join([Word(word).
lemmatize() for word in x.split()]))
df_train['message']

0        polyscimajor epa chief doesnt think carbon dio...
1          like lack evidence anthropogenic global warming
2        rt rawstory researcher say three year act clim...
3        todayinmaker wired 2016 pivotal year war clima...
4        rt soynoviodetodas 2016 racist sexist climate ...
                               ...                        
15814    rt ezlusztig took material global warming lgbt...
15815    rt washingtonpost climate change could breakin...
15816    notiven rt nytimesworld trump actually believe...
15817    rt sara8smiles hey liberal climate change crap...
15818    rt chet_cannon kurteichenwalds climate change ...
Name: message, Length: 15819, dtype: object

In [19]:
df_test['message'] = df_test['message'].apply(lambda x: " ".join([Word(word).
lemmatize() for word in x.split()]))
df_test['message']

0        polyscimajor epa chief doesnt think carbon dio...
1          like lack evidence anthropogenic global warming
2        rt rawstory researcher say three year act clim...
3        todayinmaker wired 2016 pivotal year war clima...
4        rt soynoviodetodas 2016 racist sexist climate ...
                               ...                        
10541    ecowas say addressing climate change issue end...
10542    sarahartman god he going build wall promote wa...
10543    rt sensanders dont address climate change ther...
10544    im wearing jean jacket winter global warming b...
10545    rt awuillermin know im stoked fact people elec...
Name: message, Length: 10546, dtype: object

#### CHECKING FOR NULL VALUES

In [20]:
df_train.isnull().sum()

sentiment    0
message      0
tweetid      0
dtype: int64

In [21]:
df_test.isnull().sum()

message    0
tweetid    0
dtype: int64

#### CHECKING FOR BLANKS

In [22]:
blanks = []  # start with an empty list

for i,sen,mes,twe in df_train.itertuples():  # iterate over the DataFrame
    if type(mes)==str:            # avoid NaN values
        if mes.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

0 blanks:  []


## 4. Exploratory Data Analysis

In [23]:
#Count values on Sentiment
df_train['sentiment'].value_counts()

 1    8530
 2    3640
 0    2353
-1    1296
Name: sentiment, dtype: int64

In [24]:
df_train[:398].head(10)

Unnamed: 0,sentiment,message,tweetid
0,1,polyscimajor epa chief doesnt think carbon dio...,625221
1,1,like lack evidence anthropogenic global warming,126103
2,2,rt rawstory researcher say three year act clim...,698562
3,1,todayinmaker wired 2016 pivotal year war clima...,573736
4,1,rt soynoviodetodas 2016 racist sexist climate ...,466954
5,1,worth read whether dont believe climate change...,425577
6,1,rt thenation mike penny doesnt believe global ...,294933
7,1,rt makeandmendlife six big thing today fight c...,992717
8,1,aceofspadeshq 8yo nephew inconsolable want die...,664510
9,1,rt paigetweedy offense like believe global war...,260471


## 5. Feature engineering

### Converting text to features using One hot encoding

In [26]:

le = preprocessing.LabelEncoder()
Train= df_train.apply(le.fit_transform)

In [27]:
Train.head()

Unnamed: 0,sentiment,message,tweetid
0,2,4131,9853
1,2,3217,2014
2,3,10467,11061
3,2,13389,9055
4,2,11311,7367


In [28]:
le = preprocessing.LabelEncoder()
Test= df_test.apply(le.fit_transform)

In [29]:
Test.head()

Unnamed: 0,message,tweetid
0,2749,1806
1,2151,381
2,7005,2405
3,9028,5050
4,7597,9261


In [30]:
enc = preprocessing.OneHotEncoder()
enc.fit(Train)

# 3. Transform
onehotlabels = enc.transform(Train).toarray()
onehotlabels.shape

(15819, 30040)

In [31]:
onehotlabels

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [32]:
enc = preprocessing.OneHotEncoder()
enc.fit(Test)

# 3. Transform
onehotlabels = enc.transform(Test).toarray()
onehotlabels.shape

(10546, 20129)

In [33]:
onehotlabels

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Defining Variables

In [34]:
X=df_train['message']
y= df_train['sentiment']
unseen_data = df_test['message']

In [35]:
X

0        polyscimajor epa chief doesnt think carbon dio...
1          like lack evidence anthropogenic global warming
2        rt rawstory researcher say three year act clim...
3        todayinmaker wired 2016 pivotal year war clima...
4        rt soynoviodetodas 2016 racist sexist climate ...
                               ...                        
15814    rt ezlusztig took material global warming lgbt...
15815    rt washingtonpost climate change could breakin...
15816    notiven rt nytimesworld trump actually believe...
15817    rt sara8smiles hey liberal climate change crap...
15818    rt chet_cannon kurteichenwalds climate change ...
Name: message, Length: 15819, dtype: object

In [36]:
y

0        1
1        1
2        2
3        1
4        1
        ..
15814    1
15815    2
15816    0
15817   -1
15818    0
Name: sentiment, Length: 15819, dtype: int64

In [37]:
unseen_data

0        polyscimajor epa chief doesnt think carbon dio...
1          like lack evidence anthropogenic global warming
2        rt rawstory researcher say three year act clim...
3        todayinmaker wired 2016 pivotal year war clima...
4        rt soynoviodetodas 2016 racist sexist climate ...
                               ...                        
10541    ecowas say addressing climate change issue end...
10542    sarahartman god he going build wall promote wa...
10543    rt sensanders dont address climate change ther...
10544    im wearing jean jacket winter global warming b...
10545    rt awuillermin know im stoked fact people elec...
Name: message, Length: 10546, dtype: object

### Splitting the dataset

In [38]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size =0.2, random_state=42)


# Building the model through a pipeline 

1.Logistic Regression 

In [39]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
text_clf = Pipeline([('tfidf', TfidfVectorizer(stop_words='english', 
                             min_df=1, 
                             max_df=0.9, 
                             ngram_range=(1, 2))),
                     ('clf',LogisticRegression()),
])
# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.9, max_features=None,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_sc

In [40]:
predictions = text_clf.predict(X_test)

In [41]:
#from unseen data
y_pred = text_clf.predict(unseen_data)

In [42]:
from sklearn import metrics 
print(metrics.confusion_matrix(y_test,predictions))

[[  83   36  148   11]
 [   6  157  231   31]
 [   7   48 1593  107]
 [   3   17  201  485]]


In [43]:
#Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

          -1       0.84      0.30      0.44       278
           0       0.61      0.37      0.46       425
           1       0.73      0.91      0.81      1755
           2       0.76      0.69      0.72       706

    accuracy                           0.73      3164
   macro avg       0.74      0.57      0.61      3164
weighted avg       0.73      0.73      0.71      3164



In [44]:
#print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.7326169405815424


Linear SVC

In [45]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

In [46]:
predictions = text_clf.predict(X_test)

In [47]:
#from unseen data
y_pred = text_clf.predict(unseen_data)

In [48]:
from sklearn import metrics 
print(metrics.confusion_matrix(y_test,predictions))

[[ 135   39   89   15]
 [  21  179  188   37]
 [  27   89 1507  132]
 [   8   12  146  540]]


In [49]:
#Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

          -1       0.71      0.49      0.58       278
           0       0.56      0.42      0.48       425
           1       0.78      0.86      0.82      1755
           2       0.75      0.76      0.76       706

    accuracy                           0.75      3164
   macro avg       0.70      0.63      0.66      3164
weighted avg       0.74      0.75      0.74      3164



In [50]:
#print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.7462073324905183


SVM

In [57]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
text_clf = Pipeline([('tfidf', TfidfVectorizer(stop_words='english', 
                             min_df=1, 
                             max_df=0.9, 
                             ngram_range=(1, 2))),
                     ('clf', svm.SVC(decision_function_shape='ovo')),
])
# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.9, max_features=None,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovo'

In [58]:
predictions = text_clf.predict(X_test)

In [59]:
#from unseen data
y_pred = text_clf.predict(unseen_data)

In [60]:
from sklearn import metrics 
print(metrics.confusion_matrix(y_test,predictions))

[[  67   25  179    7]
 [   4  136  264   21]
 [   1   31 1635   88]
 [   1   11  232  462]]


In [61]:
#Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

          -1       0.92      0.24      0.38       278
           0       0.67      0.32      0.43       425
           1       0.71      0.93      0.80      1755
           2       0.80      0.65      0.72       706

    accuracy                           0.73      3164
   macro avg       0.77      0.54      0.58      3164
weighted avg       0.74      0.73      0.70      3164



In [62]:
#print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.7269279393173198


# 

In [63]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
text_clf = Pipeline([('tfidf', TfidfVectorizer(stop_words='english', 
                             min_df=1, 
                             max_df=0.9, 
                             ngram_range=(1, 2))),
                     ('clf', SVC(kernel='rbf')),
])
# Feed the training data through the pipeline
text_clf.fit(X_train, y_train) 

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.9, max_features=None,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None,
                     coef0=0.0, decision_function_shape='ovr'

In [64]:
predictions = text_clf.predict(X_test)

In [65]:
#from unseen data
y_pred = text_clf.predict(unseen_data)

In [66]:
from sklearn import metrics 
print(metrics.confusion_matrix(y_test,predictions))

[[  67   25  179    7]
 [   4  136  264   21]
 [   1   31 1635   88]
 [   1   11  232  462]]


In [67]:
#Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

          -1       0.92      0.24      0.38       278
           0       0.67      0.32      0.43       425
           1       0.71      0.93      0.80      1755
           2       0.80      0.65      0.72       706

    accuracy                           0.73      3164
   macro avg       0.77      0.54      0.58      3164
weighted avg       0.74      0.73      0.70      3164



In [68]:
#print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.7269279393173198
