## Udacity Write A Data Science Blog Post Project

## Introduction

This project is part of The [Udacity](https://eu.udacity.com/) Data Scientist Nanodegree Program which is composed by:
- Term 1
    - Supervised Learning
    - Deep Learning
    - Unsupervised Learning
- Term 2
    - Write A Data Science Blog Post
    - 
    -
    
The CRISP-DM Process (Cross Industry Process for Data Mining):
1. Business Understanding
2. Data Understanding
3. Prepare Data
4. Data Modeling
5. Evaluate the Results
6. Deploy   

The goal of this project is to put in practice the technical skills teached during the program but manly to focus on the ability to effectively communicate the results of the analysis.

### Software and Libraries
This project uses Python 3.7.2 the following libraries:
- NumPy
- pandas
- scikit-learn
- Matplotlib
- seaborn

## Business Understanding

Looking at the suggested datasets I was thinkg that I wanted to do something useful. With some friends we were pondering the idea to transfer in Milan to be closer to our working places. Milan is really a beatiful place but it is also very expensive. 

Questions:
- Witch are the 5 best neighborhood?
- Witch are the 5 worst neighborhood?
- How much is different the overview of the neighborhood given from the hosts and the guests?

So I have decided to use the Airbnb dataset for the city of Milan to do a sentiment analysis of the neighborhoods

## Data Understanding

As already said the dataset is provided by [Airbnb](http://insideairbnb.com/get-the-data.html) and is basicalyy composed by:
- listings.csv:	Detailed Listings data for Milan
- calendar.csv:	Detailed Calendar Data for listings in Milan
- reviews.csv:	Detailed Review Data for listings in Milan
- summary_listings.csv:	Summary information and metrics for listings in Milan (good for visualisations).
- summary_reviews.csv: Summary Review data and Listing ID (to facilitate time based analytics and visualisations linked to a listing).
- neighbourhoods.csv: Neighbourhood list for geo filter. Sourced from city or open source GIS files.
- neighbourhoods.geojson: GeoJSON file of neighbourhoods of the city.

In [None]:
# Import libraries necessary for this project
#import string

import numpy as np
import pandas as pd
#import matplotlib.pyplot as plt
#import seaborn as sns

#from time import time
#from collections import Counter

#import nltk
#from nltk.tokenize import RegexpTokenizer
#from nltk.corpus import stopwords

from textblob import TextBlob

# Pretty display for notebooks
%matplotlib inline

data_folder = 'airbnb_data/'

# Load the datasets
df_listings_data = pd.read_csv(data_folder + 'listings.csv')
df_calendar_data = pd.read_csv(data_folder + 'calendar.csv')
df_reviews_data = pd.read_csv(data_folder + 'reviews.csv')
#df_summary_listings_data = pd.read_csv(data_folder + 'summary_listings.csv')
#df_summary_reviews_data = pd.read_csv(data_folder + 'summary_reviews.csv')

In [None]:
df_listings_data.head()

In [None]:
df_listings_data.columns

In [None]:
df_listings_data.info()

In [None]:
df_listings_data.describe()

In [None]:
print("Numerical variables:")

for name, values in df_listings_data.iteritems():
    if(values.dtype == np.float64 or values.dtype == np.int64):
        print(name)

In [None]:
print("Categorical variables values:")

for name, values in df_listings_data.iteritems():
    if(values.dtype != np.float64 and values.dtype != np.int64):
        print(name)

In [None]:
for name, values in df_listings_data.iteritems():
    if(values.dtype != np.float64 and values.dtype != np.int64):
        print('{name}: {value}\n'.format(name=name, value=values.unique()))

In [None]:
df_calendar_data.head()

In [None]:
df_calendar_data.columns

In [None]:
df_calendar_data.info()

In [None]:
df_calendar_data.describe()

In [None]:
print("Numerical variables:")

for name, values in df_calendar_data.iteritems():
    if(values.dtype == np.float64 or values.dtype == np.int64):
        print(name)

In [None]:
print("Categorical variables values:")

for name, values in df_calendar_data.iteritems():
    if(values.dtype != np.float64 and values.dtype != np.int64):
        print(name)

In [None]:
for name, values in df_calendar_data.iteritems():
    if(values.dtype != np.float64 and values.dtype != np.int64):
        print('{name}: {value}\n'.format(name=name, value=values.unique()))

In [None]:
df_reviews_data.head()

In [None]:
df_reviews_data.columns

In [None]:
df_reviews_data.info()

In [None]:
df_reviews_data.describe()

In [None]:
print("Numerical variables:")

for name, values in df_reviews_data.iteritems():
    if(values.dtype == np.float64 or values.dtype == np.int64):
        print(name)

In [None]:
print("Categorical variables values:")

for name, values in df_reviews_data.iteritems():
    if(values.dtype != np.float64 and values.dtype != np.int64):
        print(name)

In [None]:
for name, values in df_reviews_data.iteritems():
    if(values.dtype != np.float64 and values.dtype != np.int64):
        print('{name}: {value}\n'.format(name=name, value=values.unique()))

## Data Preparation and Data Modeling

Now looking at the columns of the datasets we can figure out which of them can be usefull to answer our questions, of course for our goal the main focus is on the neighbourhoods:

In [None]:
df_listings_data_cleaned = df_listings_data[['id'
                                             #, 'name'
                                             #, 'summary'
                                             #, 'space'
                                             #, 'description'
                                             , 'neighborhood_overview'
                                             #, 'transit'
                                             #, 'access'
                                             #, 'interaction'
                                             #, 'house_rules'
                                             #, 'host_about'
                                             #, 'host_neighbourhood'
                                             #, 'neighbourhood'
                                             , 'neighbourhood_cleansed']]

df_listings_data_cleaned.head()

In [None]:
df_listings_data_cleaned.shape[0]

In [None]:
len(df_listings_data_cleaned['id'].unique())

In [None]:
# Set 'id' as key in the dataframe
df_listings_data_cleaned.set_index('id', inplace = True)

In [None]:
df_listings_data_cleaned.head()

In [None]:
# Heatmap of the missing values listings_data_cleaned

plt.figure(figsize = (20, 20))
sns.heatmap(df_listings_data_cleaned.isnull(), cmap = 'Blues', cbar = False)

In [None]:
len(df_listings_data['neighbourhood_cleansed'].unique())

There are only 85 **neighbourhood_cleansed** unique entry.

In [None]:
neighbourhoods = df_listings_data['neighbourhood_cleansed'].unique()
neighbourhoods.sort()

for neighbourhood in neighbourhoods: 
    print(neighbourhood.lower())

Searching online for [Milan's neighbourhoods](http://www.museomilano.it/mediateca/media-pg-5/) and after some data cleaning we have this list of 130 neighbourhoods:
- ticinese
- magenta
- porta vercellina
- cordusio
- carrobbio
- cinquevie
- sant’ambrogio
- verziere
- san babila
- brolo-pantano
- duomo
- castello
- sempione
- brera
- borgo degli ortolani - chinatown
- porta nuova
- centrale
- centro direzionale
- porta garibaldi
- porta venezia
- risorgimento
- porta vittoria
- porta romana
- citta’ studi
- acquabella
- porta monforte
- calvairate
- lazio
- tertulliano
- porta vigentina
- porta genova
- porta lodovica
- bullona
- taliedo mecenate
- morsenchio
- gamboloita
- castagnedo
- vigentino
- corvetto
- nosedo
- santa giulia
- rogoredo
- triulzo superiore
- ponte lambro
- forlanini
- monluè
- guastalla
- ortica
- cavriano
- lambrate
- loreto
- abadesse
- ponte seveso
- isola
- tortona
- washington
- solari
- navigli
- san pietro
- la maddalena
- pagano
- fopponino
- lotto
- molinazzo
- vaiano valle
- selvanesco
- moncucco
- san cristoforo
- lorenteggio giambellino
- primaticcio 
- arzaga
- forze armate
- bisceglie
- quarto cagnino
- quinto romano
- baggio
- muggiano
- trenno
- figino
- lampugnano
- gallaratese
- cascina merlata
- certosa
- qt8
- san siro
- portello
- cagnola
- musocco
- roserio
- vialba
- ronchetto sul naviglio
- barona
- boffalora
- chiesa rossa
- conca fallata
- cantalupa
- gratosoglio
- macconago
- quintosole
- morivione
- chiaravalle
- casoretto
- greco
- bicocca
- prato centenario
- gorla
- precotto
- villa san giovanni
- adriano
- crescenzago
- rottole
- turro
- maggiolina
- montalbino
- niguarda
- tre torri
- dergano
- affori 
- bovisasca
- comasina
- bruzzano
- bovisa 
- villa pizzone
- quarto oggiaro
- farini 
- la fontana
- ronchetto delle rane
- conchetta
- porta volta
- ghisolfa

![title](img/quartieri_milano.jpg)

As we can see not all the neighbourhoods are rappresented in the dataset and moreover there is not an exact mapping between the dataset and the real neighbourhoods.

In [None]:
list_real_neighbourhood = []

with open('quartieri.txt', 'r') as file:  
    for line in file:
        item = line.replace('\n','') # remove linebreak
        list_real_neighbourhood.append(item)
        
print(len(list_real_neighbourhood))
#print(list_real_neighbourhood)

In [None]:
def is_present(item, lista):
    for elemento in lista:
        if elemento in item or item in elemento:
            return elemento
    return False

lista = ['pippo', 'pluto', 'paperino']
print(is_present('paperino', lista))

In [None]:
list_mapping_neighbourhood_real_neighbourhood = []
list_no_matched_data_neighbourhood_by_real_neighbourhood = []
list_no_matched_real_neighbourhood_by_data_neighbourhood = []

for neighbourhood in neighbourhoods:
    neighbourhood = neighbourhood.lower()
    real_neighbourhood = is_present(neighbourhood, list_real_neighbourhood)
    if real_neighbourhood == False:
        list_no_matched_data_neighbourhood_by_real_neighbourhood.append(neighbourhood)
    else:
        list_mapping_neighbourhood_real_neighbourhood.append((neighbourhood, real_neighbourhood))       

In [None]:
list_mapping_neighbourhood_real_neighbourhood

By checking the association made by our funciton we can see some errors we must correct:

In [None]:
# Update wrong association ('ronchetto sul naviglio', 'navigli') and ('bovisa', 'bovisasca'),

list_mapping_neighbourhood_real_neighbourhood_correct = []

for i in range(len(list_mapping_neighbourhood_real_neighbourhood)):
    if list_mapping_neighbourhood_real_neighbourhood[i][0] == 'ronchetto sul naviglio':
        list_mapping_neighbourhood_real_neighbourhood_correct.append(('ronchetto sul naviglio', 'ronchetto sul naviglio'))
    elif list_mapping_neighbourhood_real_neighbourhood[i][0] == 'bovisa':
        list_mapping_neighbourhood_real_neighbourhood_correct.append(('bovisa', 'bovisa'))
    else:
        list_mapping_neighbourhood_real_neighbourhood_correct.append(list_mapping_neighbourhood_real_neighbourhood[i])
        
list_mapping_neighbourhood_real_neighbourhood = list_mapping_neighbourhood_real_neighbourhood_correct
list_mapping_neighbourhood_real_neighbourhood

In [None]:
for neighbourhood in list_real_neighbourhood:
    if neighbourhood not in [element[1] for element in list_mapping_neighbourhood_real_neighbourhood]:
        list_no_matched_real_neighbourhood_by_data_neighbourhood.append(neighbourhood)
        
print(len(list_mapping_neighbourhood_real_neighbourhood))
print(len(list_no_matched_data_neighbourhood_by_real_neighbourhood))
print(len(list_no_matched_real_neighbourhood_by_data_neighbourhood))

In [None]:
print(list_no_matched_data_neighbourhood_by_real_neighbourhood)

In [None]:
print(list_no_matched_real_neighbourhood_by_data_neighbourhood)

Let's do by hand the mapping of this no matched neighbourhood:

|     Data               |     Real                         |
|------------------------|----------------------------------| 
| bande nere             | primaticcio                      |
| buenos aires - venezia | porta venezia                    |
| corsica                | acquabella                       |
| de angeli - monte rosa | tre torri                        |
| garibaldi repubblica   | porta garibaldi                  |
| ortomercato            | calvairate                       |
| padova                 | isola                            |
| parco bosco in città   | quinto romano                    |
| parco delle abbazie    | vaiano valle                     |
| parco lambro - cimiano | lambrate                         |
| parco nord             | bicocca                          |
| qt 8                   | qt8                              |
| ripamonti              | vigentino                        |
| s. cristoforo          | san cristoforo                   |
| s. siro                | san siro                         |
| sacco                  | vialba                           |
| sarpi                  | borgo degli ortolani - chinatown |
| scalo romana           | vigentino                        |
| selinunte              | san siro                         |
| stadera                | chiesa rossa                     |
| tibaldi                | conchetta                        |
| umbria - molise        | calvairate                       |
| viale monza            | gorla                            |
| villapizzone           | villa pizzone                    |
| xxii marzo             | porta vittoria                   |

In [None]:
list_manual_mapping_neighbourhood_real_neighbourhood = [ ('bande nere', 'primaticcio')
                                                        , ('buenos aires - venezia', 'porta venezia')
                                                        , ('corsica', 'acquabella')
                                                        , ('de angeli - monte rosa', 'tre torri')
                                                        , ('garibaldi repubblica', 'porta garibaldi')
                                                        , ('ortomercato', 'calvairate')
                                                        , ('padova', 'isola')
                                                        , ('parco bosco in citt\x85', 'quinto romano')
                                                        , ('parco delle abbazie', 'vaiano valle')
                                                        , ('parco lambro - cimiano', 'lambrate')
                                                        , ('parco nord', 'bicocca')
                                                        , ('qt 8', 'qt8')
                                                        , ('ripamonti', 'vigentino')
                                                        , ('s. cristoforo', 'san cristoforo')
                                                        , ('s. siro', 'san siro')
                                                        , ('sacco', 'vialba')
                                                        , ('sarpi', 'borgo degli ortolani - chinatown')
                                                        , ('scalo romana', 'vigentin')
                                                        , ('selinunte', 'san siro')
                                                        , ('stadera', 'chiesa rossa')
                                                        , ('tibaldi', 'conchetta')
                                                        , ('umbria - molise', 'calvairate')
                                                        , ('viale monza', 'gorla')
                                                        , ('villapizzone', 'villa pizzone')
                                                        , ('xxii marzo', 'porta vittoria')
                                                       ]

for tupla in list_manual_mapping_neighbourhood_real_neighbourhood:
    list_mapping_neighbourhood_real_neighbourhood.append(tupla)
    list_no_matched_data_neighbourhood_by_real_neighbourhood.remove(tupla[0])
    if tupla[1] in list_no_matched_real_neighbourhood_by_data_neighbourhood:
        list_no_matched_real_neighbourhood_by_data_neighbourhood.remove(tupla[1])
    
print(len(list_mapping_neighbourhood_real_neighbourhood))
print(len(list_no_matched_data_neighbourhood_by_real_neighbourhood))
print(len(list_no_matched_real_neighbourhood_by_data_neighbourhood))

In [None]:
list_mapping_neighbourhood_real_neighbourhood

In [None]:
# Not rappresented neighbourhood
print(list_no_matched_real_neighbourhood_by_data_neighbourhood)

Now let's map in the dataframe **neighbourhood_cleansed** to the real neighbourhoods:

In [None]:
def get_real_neighbourhood(data_neighbourhood, list_mapping):
    for tupla in list_mapping:
        if tupla[0] == data_neighbourhood:
            return tupla[1]
    return False

print(get_real_neighbourhood('villapizzone', list_mapping_neighbourhood_real_neighbourhood))

In [None]:
# Map neighbourhood to real neighbourhood
df_listings_data_cleaned['real_neighbourhood'] = [get_real_neighbourhood(neighbourhood.lower(), list_mapping_neighbourhood_real_neighbourhood) for neighbourhood in df_listings_data_cleaned['neighbourhood_cleansed']]

# Drop 'neighbourhood_cleansed' column
df_listings_data_cleaned = df_listings_data_cleaned.drop(columns=['neighbourhood_cleansed'])

In [None]:
df_listings_data_cleaned.head()

Now let's detect the language of **neighborhood_overview**:

In [None]:
text_blob = TextBlob('la casa è brutta')

print(text_blob.detect_language())

In [None]:
def detect_language(text):
    try:
        return TextBlob(text).detect_language()
    except:
        return 'not detected'
    
print(detect_language(22))

In [None]:
list_neighborhood_overview_detected_language = []

for neighborhood_overview in df_listings_data_cleaned['neighborhood_overview']:
    list_neighborhood_overview_detected_language.append(detect_language(neighborhood_overview))

In [None]:
print(len(list_neighborhood_overview_detected_language))

In [None]:
# Plot an histogram of the neighborhood_overview detected languages

list_language_count = []

for language in set(list_neighborhood_overview_detected_language): 
    list_language_count.append((language, list_neighborhood_overview_detected_language.count(language)))

#print(list_language_count)

languages = [language_count[0] for language_count in list_language_count]
counts = [language_count[1] for language_count in list_language_count]

plt.figure(figsize = (20, 10)) 
plt.bar(languages, counts)
plt.show()

In [None]:
total_number_review = df_listings_data_cleaned.shape[0]

#print(total_number_review)

for language, count in list_language_count:
    print('{0:<15} {1:>8}'.format(language, count / total_number_review))

In [None]:
# Mark each neighborhood_overview with the detected language

df_listings_data_cleaned['detected_language'] = [detect_language(neighborhood_overview) for neighborhood_overview in df_listings_data_cleaned['neighborhood_overview']]

df_listings_data_cleaned.head()

The same analysis must be done on the reviews:

In [None]:
df_reviews_data_cleaned = df_reviews_data[['listing_id'
                                           #, 'date'
                                           , 'comments']]

df_reviews_data_cleaned.head()

In [None]:
# Heatmap of the missing values in reviews_data_cleaned

# Not usefull because there is no missing value

plt.figure(figsize = (20, 20))
sns.heatmap(df_reviews_data_cleaned.isnull(), cmap = 'Blues', cbar = False)

In [None]:
list_comment_detected_language = []

for comment in df_reviews_data_cleaned['comments']:
    list_comment_detected_language.append(detect_language(comment))

In [None]:
print(len(list_comment_detected_language))

In [None]:
# Plot an histogram of the comment detected languages

list_language_count = []

for language in set(list_comment_detected_language): 
    list_language_count.append((language, list_comment_detected_language.count(language)))

#print(list_language_count)

languages = [language_count[0] for language_count in list_language_count]
counts = [language_count[1] for language_count in list_language_count]

plt.figure(figsize = (20, 10)) 
plt.bar(languages, counts)
plt.show()

In [None]:
total_number_review = df_reviews_data_cleaned.shape[0]

#print(total_number_review)

for language, count in list_language_count:
    print('{0:<15} {1:>8}'.format(language, count / total_number_review))

In [None]:
# Mark each review with the detected language

df_reviews_data_cleaned['detected_language'] = [detect_language(comment) for comment in df_reviews_data_cleaned['comments']]

df_reviews_data_cleaned.head()

In [None]:
# Save dataframe

df_listings_data_cleaned.to_csv(data_folder + 'output_' + 'listings.csv')
df_reviews_data_cleaned.to_csv(data_folder + 'output_' + 'reviews.csv')

#df_listings_data_cleaned = pd.read_csv(data_folder + 'output_' + 'listings.csv')
#df_reviews_data_cleaned = pd.read_csv(data_folder + 'output_' + 'reviews.csv')

From a cursory overview of the reviews it seems that they are in different languages:

In [None]:
# Adapted from http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/

def calculate_languages_ratios(text):
    """
    Calculate probability of given text to be written in several languages and
    return a dictionary that looks like {'french': 2, 'spanish': 4, 'english': 0}
    
    @param word_tokens: Tokenized text whose language want to be detected
    @type text: str
    
    @return: Dictionary with languages and unique stopwords seen in analyzed text
    @rtype: dict
    """

    languages_ratios = {}
    
    tokenizer = RegexpTokenizer(r'\w+')
    word_tokens = tokenizer.tokenize(text)
    words = [word.lower() for word in word_tokens]

    # Compute per language included in nltk number of unique stopwords appearing in analyzed text
    for language in stopwords.fileids():
        stopwords_set = set(stopwords.words(language))
        words_set = set(words)
        common_elements = words_set.intersection(stopwords_set)
        languages_ratios[language] = len(common_elements) # language "score"

    return languages_ratios


def detect_language(text):
    """
    Calculate probability of given text to be written in several languages and
    return the highest scored.
    
    It uses a stopwords based approach, counting how many unique stopwords
    are seen in analyzed text.
    
    @param text: Text whose language want to be detected
    @type text: str
    
    @return: Most scored language guessed
    @rtype: str
    """
    
    try:
        ratios = calculate_languages_ratios(text)
        most_rated_language = max(ratios, key = ratios.get)
    except:
        most_rated_language = 'not detected'
        
    return most_rated_language


#input_text = "This is a sample sentence, showing off the language detection"
input_text = "Questa è una frase in italiano"

print(detect_language(input_text))

In [None]:
list_review_languages = []
list_language_count = []

tokenizer = RegexpTokenizer(r'\w+')

for comment in df_reviews_data_cleaned['comments']:
    list_review_languages.append(detect_language(comment))

for language in set(list_review_languages): 
    list_language_count.append((language, list_review_languages.count(language)))

#print(list_review_languages)
#print(list_language_count)

languages = [language_count[0] for language_count in list_language_count]
counts = [language_count[1] for language_count in list_language_count]

plt.figure(figsize = (20, 10)) 
plt.bar(languages, counts)
plt.show()

In [None]:
# Mark each review with the detected language

df_reviews_data_cleaned['detected_language'] = [detect_language(comment) for comment in df_reviews_data_cleaned['comments']]

df_reviews_data_cleaned.head()

In [None]:
# Save dataframe

df_listings_data_cleaned.to_csv(data_folder + 'output_' + 'listings.csv')
df_reviews_data_cleaned.to_csv(data_folder + 'output_' + 'reviews.csv')

#df_listings_data_cleaned = pd.read_csv(data_folder + 'output_' + 'listings.csv')
#df_reviews_data_cleaned = pd.read_csv(data_folder + 'output_' + 'reviews.csv')

In [None]:
total_number_review = df_reviews_data_cleaned.shape[0]

#print(total_number_review)

for language, count in list_language_count:
    print('{0:<15} {1:>8}'.format(language, count / total_number_review))

For sake of simplicity we can focus on Italian and English and meaby later extend our research to other languages.

Synonyms neighborhood (Quartiere in Italian):
- [`English`](https://www.thesaurus.com/browse/neighborhood):
 - area
 - block
 - district
 - ghetto
 - parish
 - part
 - precinct
 - region
 - section
 - slum
 - street
 - suburb
 - territory
 - zone
- [`Italian`](https://dizionari.corriere.it/dizionario_sinonimi_contrari/Q/quartiere.shtml):
 - zona
 - vicinato
 - rione
 - sobborgo
 - borgata

Fisr let's try with English:

In [None]:
df_reviews_data_cleaned_eng = df_reviews_data_cleaned[df_reviews_data_cleaned['detected_language'] == 'english']

df_reviews_data_cleaned_eng.head()

In [None]:
print(len(df_reviews_data_cleaned_eng))

Now let's search in the review for the keywords related to neighborhood:

In [None]:
searched_words_english = [ 'neighborhood'
                          , 'area'
                          , 'block'
                          , 'district'
                          , 'ghetto'
                          , 'parish'
                          #, 'part' # removed beacuse was beeing used to mark not neighborhood related part of comments
                          , 'precinct'
                          , 'region'
                          , 'section'
                          , 'slum'
                          , 'street'
                          , 'suburb'
                          , 'territory'
                          , 'zone'
                          , 'location'
                          ]

#searched_words_italian = ['quartiere'
#                          , 'zona'
#                          , 'vicinato'
#                          , 'rione'
#                          , 'sobborgo'
#                          , 'borgata'
#                          ]

#sarched_words  = searched_words_english + searched_words_italian

In [None]:
def detect_words(text, searched_words):
    try:
        for word in searched_words:
            if word in text:
                return True
    except:
        return False
    return False

print(detect_words('questo testo continene pippo', ['pippo', 'pluto']))

In [None]:
#df_reviews_data_cleaned_eng[['comments', 'detected_language']]

In [None]:
def clean_tokenize_text(text, language):
    try:
        tokenizer = RegexpTokenizer(r'\w+')
        word_tokens = tokenizer.tokenize(text)
        stop_words = set(stopwords.words(language)) 
        filtered_sentence = [word for word in word_tokens if not word in stop_words] 
        return filtered_sentence
    except:
        return False
    
input_text = "This is a sample sentence, showing off the stop words filtration!!!"
print(clean_tokenize_text(input_text, 'english'))

Let's check the first comment:

In [None]:
# Tokenize and clean comments
comment, language = df_reviews_data_cleaned_eng[['comments','detected_language']].iloc[0]

print(comment, language)

In [None]:
print(clean_tokenize_text(comment, language))

In [None]:
print(detect_words(clean_tokenize_text(comment, language), searched_words_english))

In [None]:
# Mark each review if searched words are present

list_key_serached_words_present = []

for i in range (df_reviews_data_cleaned_eng.shape[0]):
    comment, language = df_reviews_data_cleaned_eng[['comments','detected_language']].iloc[i]
    list_key_serached_words_present.append((i, detect_words(clean_tokenize_text(comment, language), searched_words_english)))

In [None]:
print(len(list_key_serached_words_present))

In [None]:
serached_words_present = [key_serached_words_present[1] for key_serached_words_present in list_key_serached_words_present]

print(sum(serached_words_present))

In [None]:
print(sum(serached_words_present) / df_reviews_data_cleaned_eng.shape[0])

Unfortunately only 18% of the english review contains some words related to the neighborhood.

In [None]:
df_reviews_data_cleaned_eng['contains_serached_words'] = [key_serached_words_present[1] for key_serached_words_present in list_key_serached_words_present]

df_reviews_data_cleaned_eng.head()

In [None]:
df_reviews_data_cleaned['contains_serached_words'] = [detect_words(clean_tokenize_text(row['comments'], row['detected_language']), searched_words_english) for index, row in df_reviews_data_cleaned]

df_reviews_data_cleaned_eng.head()

In [None]:
df_reviews_data_cleaned_eng_contains = df_reviews_data_cleaned_eng[df_reviews_data_cleaned_eng['contains_serached_words'] == True]

df_reviews_data_cleaned_eng_contains.head()

Now the goal is to isolate the words related to neighborhood or similar:

In [None]:
comment = df_reviews_data_cleaned_eng_contains['comments'].iloc[0]

print(comment)

By looking at some comments I had realize I cuold use punctuation to isolate the phares related to neighborhood instad of remove it like I ws rhinking at the beginning.

In [None]:
def get_contextual_phrase(text, language, searched_words):
    contextual_phrase = ''
    sentences = text.split('.')
    for sentence in sentences:
        if detect_words(clean_tokenize_text(sentence, language), searched_words) == True:
            contextual_phrase = contextual_phrase + ' ' + sentence
    if contextual_phrase == '':
        return text
    else:
        return contextual_phrase

text = "Staying at Francesca's and Alberto's place was a pleasure. Just as described, well located for my purposes, an enjoyable walk to the Tortona area. The room is very nice, cleaned daily and has private bathroom.Francesca is super friendly and very helpful; whilst still respecting privacy. Overall a great experience!"

print(get_contextual_phrase(text, 'english', searched_words_english))

In [None]:
list_key_contextual_phrase = []

for i in range (df_reviews_data_cleaned_eng_contains.shape[0]):
    comment, language = df_reviews_data_cleaned_eng_contains[['comments','detected_language']].iloc[i]
    list_key_contextual_phrase.append((i, get_contextual_phrase(comment, language, searched_words_english)))

In [None]:
print(len(list_key_contextual_phrase))

In [None]:
df_reviews_data_cleaned_eng_contains['contextual_phrase'] = [key_contextual_phrase[1] for key_contextual_phrase in list_key_contextual_phrase]

df_reviews_data_cleaned_eng_contains.head()

In [None]:
comment = df_reviews_data_cleaned_eng_contains['contextual_phrase'].iloc[0]

print(comment)

In [None]:
#from textblob import TextBlob

text_blob = TextBlob('la casa è brutta')
print(text_blob.detect_language())
print(text_blob.tags)
print(text_blob.words)
print(text_blob.sentiment.polarity)

In [None]:
text_blob = TextBlob(comment)
print(text_blob.tags)
print(text_blob.words)
print(text_blob.sentiment.polarity)

In [None]:
list_key_sentiment = []

for i in range (df_reviews_data_cleaned_eng_contains.shape[0]):
    comment = df_reviews_data_cleaned_eng_contains['contextual_phrase'].iloc[i]
    text_blob = TextBlob(comment)
    list_key_sentiment.append((i, text_blob.sentiment.polarity))

In [None]:
print(len(list_key_sentiment))

In [None]:
df_reviews_data_cleaned_eng_contains['neighborhood_sentiment'] = [key_sentiment[1] for key_sentiment in list_key_sentiment]

df_reviews_data_cleaned_eng_contains.head()

In [None]:
#df_reviews_data_cleaned_eng_contains.sort_values(by = 'neighborhood_sentiment')

In [None]:
# Let's peek at some records

#comment = df_reviews_data_cleaned['comments'].iloc[325235]
#print(comment)

In [None]:
df_reviews_listing_sentiment = df_reviews_data_cleaned_eng_contains[['listing_id', 'neighborhood_sentiment']]

df_reviews_listing_sentiment = df_reviews_listing_sentiment.groupby(['listing_id'], as_index = False)['neighborhood_sentiment'].mean()
#df_reviews_listing_sentiment.set_index('listing_id', inplace = True)

df_reviews_listing_sentiment.head()

In [None]:
df_reviews_listing_sentiment.describe()

In [None]:
# Join reviews_listing_sentiment with original dataframe to link sentiment to the neighborhood

df_listings_sentiment = df_listings_data_cleaned.join(df_reviews_listing_sentiment.set_index('listing_id'), on='id')

df_listings_sentiment.head()

In [None]:
df_neighbourhood_sentiment = df_listings_sentiment[['real_neighbourhood', 'neighborhood_sentiment']]

df_neighbourhood_sentiment.head()

In [None]:
df_neighbourhood_sentiment = df_neighbourhood_sentiment.groupby(['real_neighbourhood'], as_index = False)['neighborhood_sentiment'].mean()
df_neighbourhood_sentiment.set_index('real_neighbourhood', inplace = True)
df_neighbourhood_sentiment = df_neighbourhood_sentiment.sort_values(by = ['neighborhood_sentiment'], ascending=False)

df_neighbourhood_sentiment

In [None]:
df_neighbourhood_sentiment.describe()

In [None]:
df_neighbourhood_sentiment.hist(column='neighborhood_sentiment')

In [None]:
neighborhoods = []
sentiments = []

for index, row in df_neighbourhood_sentiment.iterrows():
    neighborhoods.append(index)
    sentiments.append(row['neighborhood_sentiment']) 

In [None]:
plt.figure(figsize = (40, 20)) 
plt.bar(neighborhoods, sentiments)
plt.show()

It could be usefull to compare the results of a sentiment analysis on the **neighborhood_overview** column in the listing dataframe but there are too many missing records:

In [None]:
#df_listings_data_cleaned['neighborhood_overview']

In [None]:
number_of_element = len(df_listings_data_cleaned['neighborhood_overview'])
number_of_nan = df_listings_data_cleaned['neighborhood_overview'].isnull().sum()

print(number_of_element)
print(number_of_nan)
print((number_of_nan / number_of_element) * 100)

Let's detect the **neighborhood_overview** languages:

In [None]:
start_time = time()

all_descriptions = ''

for description in df_listings_data_cleaned['neighborhood_overview']:
    if description:
        all_descriptions = all_descriptions + ' ' + str(description)

end_time = time()
elapsed_time = end_time - start_time

print('Elapsed time: {} seconds'.format(elapsed_time))
print(len(all_descriptions))

In [None]:
# Remove punctuation

all_descriptions = all_descriptions.translate(str.maketrans('', '', string.punctuation))

print(len(all_descriptions))

In [None]:
# https://www.w3resource.com/python-exercises/string/python-data-type-string-exercise-12.php

def word_count(text):
    counts = dict()
    words = text.split(' ')

    for word in words:
        word = word.lower()
        if word in counts:
            counts[word] = counts[word] + 1
        else:
            counts[word] = 1

    return counts

print(word_count('the quick brown fox jumps over the lazy dog.'))

In [None]:
dictionary_word_count = word_count(all_descriptions)
print(len(dictionary_word_count))

In [None]:
df_all_descriptions = pd.DataFrame(list(dictionary_word_count.items()) , columns=['Word', 'Count'])
#df_all_descriptions.set_index('Word', inplace = True)
df_all_descriptions = df_all_descriptions.sort_values(by = ['Count'], ascending=False)
df_all_descriptions.head(50)

In [None]:
df_all_descriptions.shape[0]

In [None]:
df_all_descriptions[df_all_descriptions['Count'] == 1].count()[0]

In [None]:
# Drop words with frequency 1

df_all_descriptions = df_all_descriptions[df_all_descriptions['Count'] > 1]

# Looking at the words with the highest freequency we can see that we can use a threshold = 4000 to drop the most common words 
# like articles and conjunctions

df_all_descriptions = df_all_descriptions[df_all_descriptions['Count'] < 3000]


df_all_descriptions.shape[0]

In [None]:
df_all_descriptions