<a href="https://colab.research.google.com/github/Michwynn/London-Airbnb-Analysis---2/blob/Michael/Michwynn_DataCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Configuration and library set-up**

In [None]:
# data manipulation

import pandas as pd
import numpy as np 
from collections import Counter, defaultdict

# regex
import re
pattern = '\w+' # setting regex pattern

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords') # comment out if already downloaded
stop_words = set(stopwords.words('english'))
nltk.download('punkt')     # comment out if already downloaded
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# VADER sentiment tagging 
nltk.downloader.download('vader_lexicon')
!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# detecting english words
!pip install pycld3
import cld3

# machine learning & statistics
import random
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

# timeit
from tqdm import tqdm

# data visualisation
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

# set up working directory
import os
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/Airbnb_Milestone2

# supress warnings
import warnings 
warnings.filterwarnings('ignore')

# Display all columns
pd.set_option('display.max_columns', None)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/.shortcut-targets-by-id/1wUOfFY-ki2nFzneeaTtXLEeMjaSdKrrj/Airbnb_Milestone2


**Read reviews dataset**

In [None]:
reviews_df = pd.read_csv('Datasets/reviews.csv') # read data
reviews_df.columns = reviews_df.columns.str.strip() # remove white spaces in column headings
reviews_df.head(5)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,52228441,623723762668719111,2022-05-10,37052865,Kimberly,"Great location, and the host was very responsi..."
1,52228441,505671819125096360,2021-11-28,70830110,Mahelet,Duccio is a lovely and friendly host. From arr...
2,52228441,466510411892882382,2021-10-05,83617224,Will,Duccio is a good communicator… he was very hel...
3,52228441,604109461995958546,2022-04-13,2152541,Francesco,Not entirely compliant to the pics.<br/>Good l...
4,605617198416835367,633128504578904919,2022-05-23,45418187,Waddah,Great place and great host


**Check total number of NA values**

In [None]:
print(reviews_df.isna().sum())
print("There are", len(reviews_df), "no of reviews in the dataset and", reviews_df['comments'].isnull().sum(), "no of empty reviews")

listing_id        0
id                0
date              0
reviewer_id       0
reviewer_name     2
comments         92
dtype: int64
There are 1216212 no of reviews in the dataset and 92 no of empty reviews


**Removing all NA values from the dataset**

In [None]:
reviews_df = reviews_df.dropna()

**Detecting language for each review**

In [None]:
%%time
reviews_df['Language'] = reviews_df['comments'].map(lambda x: cld3.get_language(x)[0]) # detecting language of each review 

CPU times: user 6min 48s, sys: 1.18 s, total: 6min 49s
Wall time: 6min 58s


In [None]:
reviews_df.groupby('Language').size().sort_values(ascending = False)
print("% of English reviews:", len(reviews_df[reviews_df.Language == "en"])/len(reviews_df))

% of English reviews: 0.8727121874686502


**Removing non-english comments**

In [None]:
reviews_df = reviews_df[reviews_df.Language == "en"]

###########################################################################################

***Working on a sample first ***

In [None]:
sample_size = 100000
df = reviews_df.head(sample_size) # getting a sample first

###########################################################################################

**Text Data Processing**

In [None]:
%%time
df['comments_cleaned'] = df['comments'].apply(lambda x: PorterStemmer().stem(x)) # root form of all words
df['comments_cleaned'] = df['comments_cleaned'].apply(lambda x: x.lower()) # lower case all words in review
df['comments_cleaned'] = df['comments_cleaned'].apply(lambda x: re.findall(pattern, x)) # regex pattern search for all words
df['comments_cleaned'] = df['comments_cleaned'].apply(lambda x: [item for item in x if item not in stopwords.words('english')]) # Removal of stopwords
df['comments_cleaned_joined'] = df['comments_cleaned'].apply(lambda x: ' '.join(x))

CPU times: user 6min 7s, sys: 49.1 s, total: 6min 56s
Wall time: 6min 59s


**VADER Sentiment Tagging**

The result generated by VADER is a dictionary of 4 keys neg, neu, pos and compound:

neg, neu, and pos meaning negative, neutral, and positive respectively. Their sum should be equal to 1 or close to it with float operation.

compound corresponds to the sum of the valence score of each word in the lexicon and determines the degree of the sentiment rather than the actual value as opposed to the previous ones. Its value is between -1 (most extreme negative sentiment) and +1 (most extreme positive sentiment). Using the compound score can be enough to determine the underlying sentiment of a text, because for:

a positive sentiment, compound ≥ 0.05

*   a positive sentiment, compound ≥ 0.05
*   a negative sentiment, compound ≤ -0.05
*   a neutral sentiment, the compound is between [-0.05, 0.05]


In [None]:
df['sentiment_score'] = [round(SentimentIntensityAnalyzer().polarity_scores(x)['compound'], 2) for x in df['comments_cleaned_joined']]

In [None]:
# Tagging Sentiment based on Sentiment Score

# creating function to tag sentiment based on sentiment score

def sentiment_tag(row):
  if row['sentiment_score'] >= 0.05:
    sentiment = "Positive"
  elif row['sentiment_score'] <= -0.05:
    sentiment = "Negative"
  else:
    sentiment = "Neutral"
  return sentiment

  # create new col in pandas df
df['sentiment'] = df.apply(sentiment_tag, axis = 1)

**Sentiment - Explanatory Data Analysis (EDA)**

In [None]:
df.groupby('sentiment')['sentiment_score'].agg(['count','mean']).reset_index()

Unnamed: 0,sentiment,count,mean
0,Negative,857,-0.471727
1,Neutral,1886,-4.8e-05
2,Positive,97257,0.851398


**Proportion of sentiment in sample: Proxy**

In [None]:
for sentiment in df.sentiment.unique():
  print("% of "+ str(sentiment) + " in sample df:", len(df[df.sentiment == sentiment])/len(df)*100)

% of Positive in sample df: 97.257
% of Neutral in sample df: 1.886
% of Negative in sample df: 0.857
