<img style="float: left;" src="pic2.png">

### Sridhar Palle, Ph.D, spalle@emory.edu (Big Data & Data Analytics Program)

# Text Mining Project Solutions: Sentiment Analysis with yelp reviews

### Sentiment Analysis based on imdb user-reviews

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 1. Data Import

### E.1  Load the data set, store it in yelp variable and print first 3 rows

In [2]:
#Your code

In [3]:
yelp = pd.read_csv('yelp_labelled.txt', sep = '\t', header = None) 
yelp.head(3)

Unnamed: 0,0,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0


## 2. Exploration

### E.2 Preliminary exploration

**rename the columns. 0th column to Review, and 1st column to Sentiment. print first 3 lines**

In [4]:
#Your code

In [5]:
yelp.columns = ['Review', 'Sentiment']
yelp.head(3)

Unnamed: 0,Review,Sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0


**name the rows to'R0', 'R1' etc.. and show the first three rows of the data frame**

In [6]:
#Your code

In [7]:
rnames = ['R' +str(i) for i in range(0,yelp.shape[0])]
yelp.index = rnames
yelp.head(3)

Unnamed: 0,Review,Sentiment
R0,Wow... Loved this place.,1
R1,Crust is not good.,0
R2,Not tasty and the texture was just nasty.,0


**print the shape**

In [8]:
#Your code

In [9]:
yelp.shape

(1000, 2)

**How many positive & negative sentiment reviews are there**

In [10]:
#Your code

In [11]:
yelp['Sentiment'].value_counts()

1    500
0    500
Name: Sentiment, dtype: int64

## 3. Bag of Words Approach

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

### E.3 Convert the yelp review column to a bag of words document 

**Witb default arguments, What is the shape of bag of words document**

In [13]:
#Your code

In [14]:
cv = CountVectorizer()
bow = cv.fit_transform(yelp['Review'])
print (bow.shape) 

(1000, 2035)


**If we remove stop words, what is the shape?**

In [15]:
#Your code

In [16]:
cv = CountVectorizer(stop_words = 'english')
bow = cv.fit_transform(yelp['Review'])
bow.shape # now our words are filtered further to include all words that are not stop words

(1000, 1820)

**remove accents, and stopwords from the review and store this as our final bow document. and print the shape.**

In [17]:
#Your code

In [18]:
cv = CountVectorizer(stop_words = 'english', strip_accents='ascii')
bow = cv.fit_transform(yelp['Review'])
bow.shape

(1000, 1818)

### E.4 Convert the bow document into a dataframe and print the first 3 rows

In [19]:
#Your code

In [20]:
bow_df = pd.DataFrame(bow.toarray()) # to see it as a data frame
bow_df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1808,1809,1810,1811,1812,1813,1814,1815,1816,1817
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### E.5 Assign column names to the dataframe and print the first 3 rows again

In [21]:
#Your code

In [22]:
bow_df = pd.DataFrame(bow.toarray(), columns=cv.get_feature_names(), index = rnames) # to see it as an array of word counts
bow_df.head(3)

Unnamed: 0,00,10,100,11,12,15,17,1979,20,2007,...,year,years,yellow,yellowtail,yelpers,yucky,yukon,yum,yummy,zero
R0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
R1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
R2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### E.6 List the words that the bow dataframe has in Review 'R5'

In [23]:
#Your code

In [24]:
bow_df.loc['R5', bow_df.loc['R5',:].gt(0)] 

angry      1
damn       1
getting    1
pho        1
want       1
Name: R5, dtype: int64

## 4. Applying an ML model for classification 

### E.7 Split the data into training and test splits with test size = 0.2, random_state =0

**print the shapes of training, testdatasets**

In [25]:
from sklearn.model_selection import train_test_split
#Your code

In [26]:
X_train, X_test, y_train, y_test = train_test_split(bow_df,yelp['Sentiment'], test_size = 0.2, random_state=0)
print('Shape of X_train is', X_train.shape)
print('Shape of y_train is', y_train.shape)
print('Shape of X_test is', X_test.shape)
print('Shape of y_test is', y_test.shape)

Shape of X_train is (800, 1818)
Shape of y_train is (800,)
Shape of X_test is (200, 1818)
Shape of y_test is (200,)


### E.8 Apply a logistic regression model

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

**Fit the model on  training data (random_state=0)**

In [28]:
#Your code

In [29]:
logr = LogisticRegression(random_state=0)
logr.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

### E.9 Predictions, confusion matrix and accuracy on Training Data

In [30]:
#Your code

In [31]:
ypred_train = logr.predict(X_train)
print (confusion_matrix(y_train, ypred_train))
print (logr.score(X_train, y_train))

[[392  11]
 [ 16 381]]
0.96625


### E.10 Predictions, confusion matrix and accuracy on Test Data

In [32]:
ypred_test = logr.predict(X_test)
print (confusion_matrix(y_test, ypred_test))
print (logr.score(X_test, y_test))

[[80 17]
 [34 69]]
0.745
