# Analyzing Covid Effect on AirBnB Listings & Reviews in NYC 

This project analysis AirBnB data in New York City Before and After Covid, trying to find and analyze the effects Covid has on the AirBnB listings and reviews with emphasis on analyzing the listing's names, descriptions and user's reviews. The data for this project has come from multiple sources listed below.

This analysis is divided into 2 main parts:
* Initial Comparison - Some basic comparison of AirBnB data between the Pre-Covid period to Covid Period
* Further Comparison - This part is mostly based on different NLP techniques to see if there is a difference between Listing's names, descriptions and reviews between the 2 periods, and testing for different correlations between the text to other data.
<br><br>

__**Data Sources:**__
* AirBnB Listing Information from November 2018 - https://github.com/saranggupta94/airbnb
* AirBnB Listing Information from June 2020 - https://github.com/ioslilyng/airbnb_nyc
* AirBnB Listing Information and Reviews from 2021 - http://insideairbnb.com/get-the-data.html
* AirBnB Listing Information From 2019 (Kaggle) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
* AirBnB Listing Information From 2020 (Kaggle) - https://www.kaggle.com/kritikseth/us-airbnb-open-data
* NYC Covid Data - https://data.cityofnewyork.us/Health/COVID-19-Daily-Counts-of-Cases-Hospitalizations-an/rc75-m7u3
<br>
* This notebook can also be found in: https://colab.research.google.com/drive/1RKqgln2vMzgmftMydCYC3YEKK_KVML-E?usp=sharing


### Loading Data

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# import os
# os.chdir("/content/drive/MyDrive/Colab Notebooks")
# !pwd

In [None]:
# !pip install plotly
# !pip install seaborn
# !pip install wordcloud
# !pip install nltk
# !pip install textblob
# !pip install spacy
# !pip install gensim
# !pip install pyLDAvis
# !pip install wordcloud

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import re

import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.graph_objs as go
from typing import List, Tuple
import seaborn as sb
from plotly.subplots import make_subplots
from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import CountVectorizer
import spacy
import gensim
import pyLDAvis.gensim_models as gensimvis
from gensim.models import Word2Vec
from sklearn.manifold import TSNE
from nltk.stem import PorterStemmer
import numpy as np

import plotly.offline as pyo
pyo.init_notebook_mode()

import pyLDAvis
pyLDAvis.enable_notebook()

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from IPython.display import Markdown, display
def print_bold(string):
    display(Markdown(f'**{string}'))

In [None]:
def merge_listings_text(df):
    summary = df['summary']
    space = df['space']
    desc = df['description']
    ans = list()
    for c1, c2, c3 in zip(summary, space, desc):
        ans.append(str(c1) + " " + str(c2) + " " + str(c3))
    new_df = df.drop(['summary', 'space'], axis=1)
    new_df['description'] = ans

    return new_df

In [None]:
df_pre = pd.read_csv("data/listings-November-2018.csv")
df_pre = merge_listings_text(df_pre)
df_pre

In [None]:
df_kaggle_pre = pd.read_csv("data/AB_NYC_2019.csv")
df_kaggle_pre

In [None]:
df_kaggle_covid = pd.read_csv("data/AB_US_2020.csv")
df_kaggle_covid = df_kaggle_covid[df_kaggle_covid["city"] == "New York City"]
df_kaggle_covid

In [None]:
df_2020_lst = pd.read_csv("data/listings-June2020.csv")
df_2020_lst = merge_listings_text(df_2020_lst)
df_2020_lst

In [None]:
df_21_march_lst = pd.read_csv("data/listings-march21.csv.gz")
df_21_june_lst = pd.read_csv("data/listings-june-21.csv.gz")
df_21_sep_lst = pd.read_csv("data/listings-september-21.csv.gz")

In [None]:
df_reviews = pd.read_csv("data/reviews-december-2021.csv.gz")
df_reviews["date"] = df_reviews["date"].astype('datetime64')
df_reviews

In [None]:
df_pre_reviews = df_reviews[(df_reviews["date"] < "2020-02-29") & (df_reviews["date"] >= "2018-11-15")]
df_pre_reviews

In [None]:
df_covid_reviews = df_reviews[df_reviews["date"] >= "2020-02-29"]
df_covid_reviews

In [None]:
df_ny_covid_data = pd.read_csv("data/NYC-covid-data-day-by-day.csv")
df_ny_covid_data['DATE_OF_INTEREST'] = pd.to_datetime(df_ny_covid_data['DATE_OF_INTEREST'])
df_ny_covid_data

In [None]:
RENDERER = 'notebook'   # Jupyter notebook
# RENDERER = 'jupyterlab'   # Data spell
# RENDERER = 'colab'      # Colab

# Initial Comparison

Start with initial data exploration and differences between Pre-Covid and Covid Periods

## General Covid Data

In [None]:
def covid_moving_avg(orig_df, title):
    df = orig_df.groupby('DATE_OF_INTEREST').sum().reset_index().loc[:, ['DATE_OF_INTEREST','CASE_COUNT']]
    df["avg"] = df['CASE_COUNT'].rolling(7).sum()
    return px.line(df, x='DATE_OF_INTEREST', y="avg", title=title)

In [None]:
covid_moving_avg(df_ny_covid_data, "7-Day Moving Average Covid New Cases in NYC").show(renderer=RENDERER)

We can see that in NYC there were 3 waves. The first from mid March 2020 to July 2020 The second form November 2020 to June 2021 and from December 2021 to February 2022

## Prices Comparison

Checking for differences in amount of listings and prices due to Covid.

In [None]:
df_kaggle_pre_neighborhood = df_kaggle_pre[df_kaggle_pre["price"] != 0].groupby("neighbourhood_group")
df_kaggle_covid_neighborhood = df_kaggle_covid[df_kaggle_covid["price"] != 0].groupby("neighbourhood_group")

In [None]:
pre_n_count = df_kaggle_pre_neighborhood.count()['id'].to_frame().reset_index().rename(columns={"id": "2019"}).astype({"neighbourhood_group": 'string'})
covid_n_count = df_kaggle_covid_neighborhood.count()['id'].to_frame().reset_index().rename(columns={"id": "2020"}).astype({"neighbourhood_group": 'string'})
pre_n_max = df_kaggle_pre_neighborhood.max()['price'].to_frame().reset_index().rename(columns={"price": "2019"}).astype({"neighbourhood_group": 'string'})
covid_n_max = df_kaggle_covid_neighborhood.max()['price'].to_frame().reset_index().rename(columns={"price": "2020"}).astype({"neighbourhood_group": 'string'})
pre_n_min = df_kaggle_pre_neighborhood.min()['price'].to_frame().reset_index().rename(columns={"price": "2019"}).astype({"neighbourhood_group": 'string'})
covid_n_min = df_kaggle_covid_neighborhood.min()['price'].to_frame().reset_index().rename(columns={"price": "2020"}).astype({"neighbourhood_group": 'string'})

In [None]:
def comparison_bar_plot(dfs: Tuple, names, x_name = "neighbourhood_group", y_names = ("2019", "2020"), title = None, axes_titles = {"x": None, "y": None},
                        print_diff=False):
    xs = [df[x_name].to_list() for df in dfs]
    ys = [df[y_name].to_list() for df, y_name in zip(dfs, y_names)]
    max_y = max([max(y) for y in ys])
    fig = go.Figure(layout=dict(
        title=title,
        yaxis={"title": axes_titles["y"], "range": [0 - max_y*0.1 if print_diff else 0, max_y * 1.2]},
        xaxis={"title": axes_titles["x"]}
    ))
    for curr_x, curr_y, curr_name in zip(xs, ys, names):
        fig.add_trace(go.Bar(x=curr_x, y=curr_y, text=curr_y, textposition="outside", name=curr_name))
    if print_diff:
        for curr, v1, v2 in zip(xs[0], ys[0], ys[1]):
            diff = abs(v1-v2)
            fig.add_annotation(x=curr, text=f"Difference: {diff}", showarrow=False, yref="paper", yanchor="bottom", y=0, font={"color": "red"})
    return fig


comparison_bar_plot((pre_n_count, covid_n_count), names=["Pre-Covid", "During Covid"], title="Number of Listings Pre and During Covid",
                    axes_titles = {"x": "Neighbourhood Group", "y": "No. of Listings"}, print_diff=True).show(renderer=RENDERER)

comparison_bar_plot((pre_n_max, covid_n_max), y_names = ("2019", "2020"), names=["Pre-Covid", "During Covid"],
                    title="Maximum Listing Price", axes_titles = {"x": "Neighbourhood Group", "y": "Max Listing Price"}, print_diff=True).show(renderer=RENDERER)

comparison_bar_plot((pre_n_min, covid_n_min), y_names = ("2019", "2020"), names=["Pre-Covid", "During Covid"],
                    title="Minimum Listing Price", axes_titles = {"x": "Neighbourhood Group", "y": "Min Listing Price"}, print_diff=True).show(renderer=RENDERER)

In [None]:
def draw_box_price(pre, during):
    z1 = pre.loc[:,["neighbourhood_group", "price", "latitude", "longitude"]]
    z1["year"] = "Pre-Covid"
    z2 = during.loc[:,["neighbourhood_group", "price", "latitude", "longitude"]]
    z2["year"] = "Covid"
    data = pd.concat((z1,z2))
    fig, ax = plt.subplots(figsize=(20, 9))
    order = data["neighbourhood_group"].unique().tolist()
    order.sort()
    ax = sb.boxplot(x="neighbourhood_group", y="price", data=data, showfliers=False, hue="year", ax=ax,order=order)
    ax.set(xlabel = 'Neighbourhood Group', ylabel = 'Price')
    plt.legend(title='Time Period')
    return fig

draw_box_price(df_kaggle_pre, df_kaggle_covid).show()

__**Prices Conclusions**__

**Number of listings in Brooklyn and Manhattan have decreased due to Covid and increased slightly in other regions.

We can see that the amount of listings between the pre-covid time to 2020 (Start and middle of Covid) The amount of listings has dropped in most NYC regions.
The maximum asking price in Bronx and Staten Island have changed significantly while in other places it stayed the same. Surprisingly, the minimum asking price have increased in these 2 regions(Bronx and Staten Island) and stayed almost the same in other places.

In most neighborhoods the mean prices have stayed the same except for Manhattan where it has dropped significantly. Also, prices range have stayed mostly the same except for the Bronx.**


---

## Room Types

Checking if there was any difference in room type's offered in AirBnB due to Covid

In [None]:
def draw_rooms_bar(pre, during):
    z1 = pre.groupby(["neighbourhood_group","room_type"]).count()["id"].reset_index().rename(columns={"id": "2019"})
    z2 = during.groupby(["neighbourhood_group","room_type"]).count()["id"].reset_index().rename(columns={"id": "2020"})
    data = pd.merge(z1,z2)
    room_types = data["room_type"].unique().tolist()
    fig = make_subplots(1,3, column_titles=room_types)
    fig.update_layout(title="Amount of Listing per Room Type")
    # neighborhoods = data["neighbourhood_group"].unique().tolist()
    for idx, rt in enumerate(room_types):
        curr = data[data["room_type"]==rt]
        fig.add_trace(go.Bar(x = curr["neighbourhood_group"].tolist(), y=curr["2019"].tolist(), name=f"2019-{rt}", marker_color="blue"), row=1, col=idx+1)
        fig.add_trace(go.Bar(x = curr["neighbourhood_group"].tolist(), y=curr["2020"].tolist(), name=f"2020-{rt}", marker_color="red"), row=1, col=idx+1)
    return fig

draw_rooms_bar(df_kaggle_pre, df_kaggle_covid).show(renderer=RENDERER)

**We can see that there was a decrease in the amount of shared rooms across all neighborhoods, to be expected during a pandemic.**
<br><br><br>
---

## Average Reviews in a Month

From the data we can't really know about the amount of reservations made, but the amount of reviews can be a close enough proxy for this.

Checking the changes in average number of reviews over time.

In [None]:
def draw_avg_reviews_per_month(pre, covid):
    z1 = pre.groupby("neighbourhood_group").mean()["reviews_per_month"].reset_index().rename(columns={"reviews_per_month": "2019"})
    z2 = covid.groupby("neighbourhood_group").mean()["reviews_per_month"].reset_index().rename(columns={"reviews_per_month": "2020"})
    data = pd.merge(z1,z2)
    return px.bar(data, x="neighbourhood_group", y=["2019","2020"], barmode="group", title="Average Number of Reviews in Each Neigborhood")

draw_avg_reviews_per_month(df_kaggle_pre, df_kaggle_covid).show(renderer=RENDERER)

**The average number of reviews has lowered in all the different neighborhoods, as expected when the number of bookings decreased due to Covid-19.**
<br><br><br>

---

In [None]:
def plot_moving_avg(orig_df, col_name, title):
    df = orig_df.groupby('date').count().reset_index().loc[:, ['date', col_name]]
    df['avg'] = df[col_name].rolling(7).sum()
    return px.line(df, x='date', y=col_name, title=title,line_shape='spline')

In [None]:
plot_moving_avg(df_pre_reviews, 'comments', "Pre-Covid 7 Day Moving Average Reviews Amount").show(renderer=RENDERER)

In [None]:
plot_moving_avg(df_covid_reviews, 'comments', "Covid 7 Day Moving Average Reviews Amount").show(renderer=RENDERER)

From the amounts of reviews we can see that during almost 2 years of Covid there were less reviews in AirBnB compared to a pre-covid period with same length.
We can also see that the first wave in NYC caused a big drop in amount of reviews, probably due to a drop in the amount bookings. Other waves didn't affect the amount of reviews that kept growing steadily till the end of the data (except for last 2 weeks) This can be because of the third wave or because the data ends.
<br><br>
---

# Further Comparison

This part included more complex comparison mostly based on NLP techniques on the Listing's names, descriptions and reviews.

This part is separated according to the type of parameter used for comparison the parameters are: Listing's Name, Listing's Description, User Reviews

## Different Functionality

Below is a list of words that are related to Covid that I'll refer to as Covid Related Words. I'll use these words for checking data that is connected to Covid.

In [None]:
COVID_WORDS = list({'covid', 'influenza', 'sanitized', "cdc", "corona", 'coronavirus', 'virus', 'healthcare', 'disinfection', 'disinfected', 'lysol', 'disinfect',
               'sanitizer', 'sanitized', 'sanitize', 'disease', 'prevention', 'quarantine', 'distancing', 'pandemic', 'germ', 'germs', 'germ-fre', 'germ-free',
               'covid-19', 'covid19', 'co-vid', 'corona', 'virus', 'viruses', 'disease', 'bacterial', 'bacteria', 'antibacterial', 'anti-bacterial', 'vaccine',
               'vaccinated', 'vaccinations', 'vaccines', 'epidemic', 'pandemic', 'outbreak', 'isolation'})

In [None]:
from collections import Counter
import re

TO_REMOVE_IRRELEVANT = None

def strip_punc_digit(word):
    return re.sub(r'(<.+>)|[\d]|[^\w\s]', "", word)

def exclude_word(word):
    global TO_REMOVE_IRRELEVANT
    if TO_REMOVE_IRRELEVANT is None:
        irrelevant_worlds = {"U", "Jr", "W", "E", "S", "N", "Nr", "East", "West", "South", "North", "Est", "Sou",
                             "street", "boulevard", "br", "st", "th", "j", "Ny", "nyc", "new york"}
        to_remove = irrelevant_worlds.union(STOPWORDS)
        to_remove = to_remove.union(nltk_stopwords.words("spanish"))
        to_remove = to_remove.union(nltk_stopwords.words("french"))
        to_remove = {word.lower() for word in to_remove}
        TO_REMOVE_IRRELEVANT = to_remove

    return word.lower() in TO_REMOVE_IRRELEVANT


def create_word_counts(sentences_lst: List[str], min_count: int):
    all_words = [strip_punc_digit(word)  for name in sentences_lst for word in str(name).split(" ")]
    clean_words = [word.lower().capitalize() for word in all_words if not exclude_word(word) and word != ""]
    counts = Counter(clean_words)
    filtered_counts = {k: v for k,v in dict(counts).items() if v > min_count}
    return filtered_counts


def create_word_cloud(sentences_lst: List[str], min_count: int, **kwargs):
    filtered_counts = create_word_counts(sentences_lst, min_count)
    wordcloud = WordCloud(width = 800, height = 800,
                          background_color ='white',
                          stopwords = set(STOPWORDS),
                          min_font_size = 10,
                          **kwargs)
    return wordcloud.generate_from_frequencies(filtered_counts)


def new_words_after_covid(pre: List[List[str]], covid: List[List[str]], title: str, words_to_show=50):
    pre_words = set()
    for sentences_lst in pre:
        pre_words.update({strip_punc_digit(word.lower()).encode("ascii", "ignore").decode("utf-8") for sentence in sentences_lst
                          for word in str(sentence).split(" ")  if not exclude_word(word)})
    all_covid_words = list()
    for sentences_lst in covid:
        all_covid_words += [strip_punc_digit(word.lower()).encode("ascii", "ignore").decode("utf-8") for sentence in sentences_lst
                            for word in str(sentence).split(" ") if not exclude_word(word)]
    new_words = {word for word in set(all_covid_words) if word not in pre_words}
    counts = Counter(all_covid_words)
    new_words_counts = {word: counts.get(word) for word in new_words}
    new_words_counts = {k:v for k, v in sorted(new_words_counts.items(), key=lambda item: item[1], reverse=True)}
    fig = go.Figure(layout=dict(title=title))
    fig.add_trace(go.Bar(x=list(new_words_counts.keys())[:words_to_show], y=list(new_words_counts.values())[:words_to_show]))
    return fig


def words_change(pre: List[List[str]], covid: List[List[str]], title, sub_titles,
                 words_to_show = 50, min_count = 50):
    pre_conc = [sentence for sentences_lst in pre for sentence in sentences_lst]
    covid_conc = [sentence for sentences_lst in covid for sentence in sentences_lst]
    pre_count = create_word_counts(pre_conc, min_count)
    covid_count = create_word_counts(covid_conc, min_count)
    diffs = {k: (v - covid_count.get(k,0)) for k, v in pre_count.items()}
    diffs = {k: v for k, v in sorted(diffs.items(), key=lambda item: item[1], reverse=True)}
    neg_diffs = dict()
    for word, count in reversed(diffs.items()):
        if count < 0:
            neg_diffs[word] = abs(count)
        else:
            break

    fig = make_subplots(2,1, vertical_spacing=0.4, subplot_titles=sub_titles)
    fig.add_trace(go.Bar(x=list(diffs.keys())[:words_to_show],y=list(diffs.values())[:words_to_show], name="More occurrences Pre-Covid"), row=1, col=1)
    fig.add_trace(go.Bar(x=list(neg_diffs.keys())[:words_to_show],y=list(neg_diffs.values())[:words_to_show], name="Less occurrences Pre-Covid"), row=2, col=1)
    fig.update_layout(title=title, height=600, showlegend=False)
    return fig

In [None]:
def get_sentiment_analysis(data_lst: List[List[str]]):
    ans_info = {"polarity": list(), "subjectivity": list()}
    for sentences_lst in data_lst:
        for sentence in sentences_lst:
            ans = TextBlob(str(sentence))
            ans_info["polarity"].append(ans.sentiment[0])
            ans_info["subjectivity"].append(ans.sentiment[1])
    return ans_info


def sentiment_analysis_diff(pre: List[List[str]], covid: List[List[str]], title):
    pre_info = get_sentiment_analysis(pre)
    covid_info = get_sentiment_analysis(covid)

    fig = go.Figure(layout=dict(title=title))
    fig.add_trace(go.Box(y=pre_info['polarity'], name="Pre-Covid Polarity"))
    fig.add_trace(go.Box(y=covid_info['polarity'], name="Covid Polarity"))
    fig.add_trace(go.Box(y=pre_info['subjectivity'], name="Pre-Covid Subjectivity"))
    fig.add_trace(go.Box(y=covid_info['subjectivity'], name="Covid Subjectivity"))
    return fig

In [None]:
def count_words_in_df(df,  words: List[str], text_col: str):
    ps = PorterStemmer()
    words = [ps.stem(word).lower() for word in words]

    def words_counter(sentence):
        if not isinstance(sentence, str):
            return 0, 0, 0
        count = 0
        split_sentence = sentence.split(" ")
        split_sentence = [ps.stem(word).lower() if isinstance(word, str) else word for word in split_sentence]
        for word in words:
            count += split_sentence.count(word.lower())
        return count, len(split_sentence), count/len(split_sentence)

    count_data = df[text_col].apply(words_counter)
    df['counts'] = [count[0] for count in count_data]
    df['total'] = [count[1] for count in count_data]
    df['ratio'] = [count[2] for count in count_data]
    return df

# get_price_words_corr_df(df_kaggle_pre[["id","price"]][:500], df_pre_reviews[['listing_id','comments']][:500], left_on='id', right_on='listing_id', words=COVID_WORDS, text_col='comments').corr()

In [None]:
def find_text_diff(df_pre, df_covid, left_on, right_on, diff_col, relevant_words: List[str]):
    ps = PorterStemmer()
    relevant_words = [ps.stem(word).lower() for word in relevant_words] if relevant_words is not None else None

    def find_diff(data):
        ans = list()
        not_related_ans = list()
        old = data[0]
        new = data[1]
        if not isinstance(old, str) or not isinstance(new, str) or old == new:
            return ans, not_related_ans
        old = [ps.stem(word).lower() for word in old.split(" ")]
        new = [ps.stem(word).lower() for word in new.split(" ")]
        for word in new:
            word = strip_punc_digit(word)
            if word not in old:
                if (relevant_words is not None and word in relevant_words) or (relevant_words is None):
                    ans.append(word)
                else:
                    not_related_ans.append(word)
        return ans, not_related_ans

    df = df_pre if df_covid is None else pd.merge(df_pre, df_covid, left_on=left_on, right_on=right_on, suffixes=("_old","_new"))
    olds = df[f'{diff_col}_old']
    news = df[f'{diff_col}_new']
    diffs = list(map(find_diff, zip(olds, news)))
    return diffs, df

def calc_diff_amounts(df_pre, df_covid, left_on, right_on, diff_cols: List[str], relevant_words: List[str]):
    ans = list()
    total_changes = list()
    df = pd.merge(df_pre, df_covid, left_on=left_on, right_on=right_on, suffixes=("_old","_new"))
    for curr_col in  diff_cols:
        diffs, df = find_text_diff(df, None, left_on, right_on, curr_col, relevant_words)
        diff_count = sum([1 if len(curr[0]) != 0 else 0 for curr in diffs])
        word_counts = {word: sum([curr[0].count(word) for curr in diffs]) for word in relevant_words}
        word_counts = {word: count for word, count in word_counts.items() if count > 0}
        total_changes.append(sum([1 if len(curr[0]) != 0  or len(curr[1]) != 0 else 0 for curr in diffs]))
        ans.append((word_counts, diff_count, len(df)))

    return ans, total_changes

In [None]:
def clean_and_flatten_corpus(data: List[List[str]]):
    corpus = [sentece for part in data for sentece in part]
    lem_corpus = list()
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    for sentence in corpus:
        if sentence == "" or not isinstance(sentence, str):
            continue
        doc = nlp(sentence)
        lem_words = [token.lemma_ for token in doc]
        new_sentence = " ".join(lem_words).encode("ascii", "ignore").decode("utf-8").strip()
        if len(new_sentence) == 0:
            continue
        lem_corpus.append(new_sentence)
    return lem_corpus

## Names

Exploring data about Listing's names



### Words in Names Pre & During Covid

In [None]:
def create_name_word_cloud(df, title, min_count = 500):
    names = df["name"].tolist()
    plt.figure(figsize = (15, 15), facecolor = None)
    plt.title(title)
    plt.imshow(create_word_cloud(names, min_count))

**Word Cloud of Words in Names Pre-Covid**

In [None]:
create_name_word_cloud(df_pre, "Words in Listings Names Before Covid")

**Word Cloud of Words in Names Pre-Covid**

In [None]:
create_name_word_cloud(df_2020_lst,"Words in Listings Names After Covid")

### New Words Added to Names During Covid

Checking for new words in the listing's names that weren't used before Covid.

In [None]:
new_words_after_covid([df_pre["name"].tolist(), df_kaggle_pre["name"].tolist()],
                      [df_2020_lst["name"].tolist(), df_21_march_lst["name"].tolist(), df_21_june_lst["name"].tolist()],
                      "Top New Words in Listing's Names During Covid").show(renderer=RENDERER)

From this graph we can see that 3 main groups of new words were added to the listings names:
* Companies - companies that started listing in AirBNB during Covid, or maybe even new companies created during Covid, like: Alohause, Staypineapple, Incentra and more.
* Words directly related to Covid - like: CDC, sanitized, disinfected and more.
* Words Relevant to the Life Covid Created - like: WFH, Peloton (home exercise equipment company), cubicle and more.

### Word Occurrences Changes Pre & During Covid

Trying to see biggest change in words occurrences before and during Covid.

In [None]:
words_change([df_pre["name"].tolist()], [df_2020_lst["name"].tolist()],
             "Top Words Occurrences Difference Pre-covid than During Covid",
             ["Words Occurrences Difference Higher Pre-Covid ", "Words Occurences Difference Lower During Covid"]).show(renderer=RENDERER)

**We see a decrease in the usage of words that are directly connected to Covid restrictions like: Gym (less relevant due to restriction), Park (more relevant when other places are closed), JFK & LGA (Airports, less relevant when there are less flights).**
<br><br><br>
---

### Sentiment Change in Listing's Names Pre & During Covid

Trying to see if there is any difference in Sentiment (Subjectivity & Polarity) in Listing's names due to Covid.

Sentiment Analysis in this project is done using TextBlob library (https://textblob.readthedocs.io/en/dev/)

In [None]:
sentiment_analysis_diff([df_pre["name"].tolist(), df_kaggle_pre["name"].tolist()],
                        [df_2020_lst["name"].tolist(), df_21_march_lst["name"].tolist(), df_21_june_lst["name"].tolist(), df_21_sep_lst["name"].tolist()],
                        "Subjectivity and Polarity of Listings Names Before and During Covid").show(renderer=RENDERER)

**There is almost no difference in names polarity and subjectivity from due to Covid**
<br><br><br>
---

## Descriptions

Exploring data about Listing's Descriptions

### Words in Descriptions Pre & During Covid

In [None]:
def create_description_word_cloud(df, title, min_count = 850):
    names = df["description"].tolist()
    plt.figure(figsize = (15, 15), facecolor = None)
    plt.title(title)
    plt.imshow(create_word_cloud(names, min_count))

**Word Cloud of Words in Descriptions Pre-Covid**

In [None]:
create_description_word_cloud(df_pre, "Words in Listings Descriptions Before Covid")

**Word Cloud of Words in Description During Covid**

In [None]:
create_description_word_cloud(df_2020_lst, "Words in Listings Descriptions After Covid")

### New Words Added to Descriptions During Covid

Checking for new words in the listing's descriptions that weren't used before Covid.

In [None]:
new_words_after_covid([df_pre["description"].tolist()],
                      [df_2020_lst["description"].tolist()],
                      "Top New Words in Listing's Description During Covid").show(renderer=RENDERER)

**As in the names we see the names of different companies that are new and words relevant to covid (CDC, Covid, virus, etc.).
There is also new words regarding Covid life, like: Chromecast, Liveready (Sound systems seller).**
<br><br><br>
---

### Word Occurrences Changes Pre & During Covid

In [None]:
words_change([df_pre["description"].tolist()], [df_2020_lst["description"].tolist()],
             "Top Words Occurrences Difference Pre-covid than During Covid in Descriptions",
             ["Words Occurrences Difference Higher Pre-Covid ", "Words Occurences Difference Lower During Covid"]).show(renderer=RENDERER)

### Sentiment Change in Listing's Descriptions Pre & During Covid

Trying to see if there is any difference in Sentiment (Subjectivity & Polarity) in Listing's descriptions due to Covid.

In [None]:
sentiment_analysis_diff([df_pre["description"].tolist()],
                        [df_2020_lst["description"].tolist(), df_21_march_lst["description"].tolist(),
                         df_21_june_lst["description"].tolist(), df_21_sep_lst["description"].tolist()],
                        "Subjectivity and Polarity of Listings Names Before and During Covid").show(renderer=RENDERER)

**As with the names there is almost no difference in polarity and subjectivity in the descriptions**
<br><br><br>
---

## Name & Descriptions

Exploring difference in both names and descriptions due to Covid.

### Covid Words Changes in Names & Descriptions

Exploring how many names and listings have changed to include Covid Related Words

In [None]:
def names_descriptions_covid_words():
    diffs_name, name_changes = calc_diff_amounts(df_kaggle_pre, df_kaggle_covid, left_on='id', right_on='id', diff_cols=['name'], relevant_words=COVID_WORDS)
    diffs_name = diffs_name[0]
    name_changes = name_changes[0]
    diffs_description, desc_changes = calc_diff_amounts(df_pre, df_2020_lst, left_on='id', right_on='id', diff_cols=['description'], relevant_words=COVID_WORDS)
    diffs_description = diffs_description[0]
    desc_changes = desc_changes[0]
    print('n\n')
    print_bold(f"Listing's names found before and after covid: {diffs_name[2]}. Listing's names changed: {name_changes}."
          f" Listing's names change with Covid words: {diffs_name[1]}")
    print('n\n')
    print_bold(f"Listing' descriptions found before and after covid: {diffs_description[2]}. Listing's descriptions changed: {desc_changes}."
          f" Listing's descriptions change with Covid words: {diffs_description[1]}")
    print('n\n\n')
    fig = go.Figure(layout=dict(title="Amount of Covid Related Words Changes From Pre-covid Listing's Names & Descriptions", yaxis_title="Covid Related Word", xaxis_title='Amount of Changes Including the Word'))
    fig.add_trace(go.Bar(x=list(diffs_name[0].keys()), y=list(diffs_name[0].values()), name="Name Changes"))
    fig.add_trace(go.Bar(x=list(diffs_description[0].keys()), y=list(diffs_description[0].values()), name="Description Changes"))
    fig.update_yaxes()
    fig.show(renderer=RENDERER)

names_descriptions_covid_words()

**Surprisingly only a small portion of the listings before and after covid have changed their name and description to include Covid related words, although a significant amount of names and descriptions were changed.**
<br><br><br>

---

### Check for Price Changes in Listings with Names & Descriptions Changes

Trying to find if there is any difference in price changes in listings that have also changed their description or name during covid and listings that haven't changed.

In [None]:
def merge_split_by_changes(df_pre, df_covid, relevant_words):
    df = pd.merge(df_pre, df_covid, on='id',suffixes=("_old","_new"))
    ps = PorterStemmer()
    relevant_words = [ps.stem(word).lower() for word in relevant_words] if relevant_words is not None else None

    def helper(data):
        old_name = data[0]
        new_name = data[1]
        old_desc = data[2]
        new_desc = data[3]
        for word in relevant_words:
            if (word in str(old_name) and word not in str(new_name)) or (word in old_desc and str(word) not in str(new_desc)):
                return True
        return False

    changed_cond = (df['name_old'] == df['name_new']) & (df['description_old'] == df['description_new'])
    df_same = df[changed_cond]
    df_diff = df[~changed_cond]
    old_name = df_diff['name_old']
    new_name = df_diff['name_new']
    old_description = df_diff['description_old']
    new_description = df_diff['description_new']
    covid_change = list(map(helper,zip(old_name, new_name, old_description, new_description)))
    covid_change_df = df_diff[covid_change]
    changed_df = df_diff[[not x for x in covid_change]]

    return df_same, covid_change_df, changed_df


def get_price_diffs(df):
    price_to_float = lambda price: float(re.sub("[,\\$]","", price))
    old = df['price_old'].apply(price_to_float)
    new = df['price_new'].apply(price_to_float)
    return new - old


def create_price_diffs_fig(same_df, covid_change_df, non_covid_change_df, title):
    same_changes = get_price_diffs(same_df)
    covid_changes = get_price_diffs(covid_change_df)
    non_covid_changes = get_price_diffs(non_covid_change_df)
    changes_arr = [same_changes, covid_changes, non_covid_changes]
    names = [name for name, x in zip(["No Change", 'Covid Word Change', 'Non Covid Word Change'], changes_arr) if len(x) > 0]
    data = [np.mean(x) for x in changes_arr if len(x) > 0]
    fig = go.Figure(layout=dict(title=title, xaxis_title='Type of Change', yaxis_title='Mean Price Change'))
    fig.add_trace(go.Bar(x=names, y=data))

    return fig

In [None]:
create_price_diffs_fig(*merge_split_by_changes(df_pre[['id', 'name', 'description','price']], df_2020_lst[['id', 'name', 'description','price']], COVID_WORDS),
                       "Mean Prices Change From Pre-Covid to Start of Covid (June-2020)").show(renderer=RENDERER)

**We can see that listings that didn't changed their name or description during covid have a slightly increased price, while other listings that changed their name or description have lowered their price. Listings that changed their name or description with Covid related words have a log bigger change than changed that don't include Covid words.**
<br><br><br>

---

In [None]:
create_price_diffs_fig(*merge_split_by_changes(df_pre[['id', 'name', 'description','price']], df_21_sep_lst[['id', 'name', 'description','price']], COVID_WORDS),
                       "Mean Prices Change From Pre-Covid to Latest Time (September 2021) (Covid is around for a while)").show(renderer=RENDERER)

**The data doesn't have a listing that didn't changed its price from Pre covid to September 2021 (to be expected with inflation).
Interestingly we can see that all listings have increased their price and listings that have Covid related words have increased their price more than listings that changed without using Covid words.**
<br><br><br>
---

### Correlation Between Covid Words and Price

Checking for correlation between Covid words occurrences in listings names and descriptions to price

In [None]:
def corr_covid_words_price_names_desc():
    fix_price = lambda pr: float(re.sub("[,\\$]","", pr))
    df = count_words_in_df(df_2020_lst, COVID_WORDS, 'name')[['counts','total','ratio','price']]
    df['price'] = df['price'].apply(fix_price)
    name_df = df.corr().loc[['counts','total', 'ratio'], 'price']
    name_df = name_df.rename({"counts": 'Covid Words Used in Name', 'total': 'Name Length', 'ratio': 'Ratio Between Covid Words to Total Name Length'})

    desc_df = count_words_in_df(df_2020_lst, COVID_WORDS, 'description')[['counts','total','ratio','price']]
    desc_df['price'] = desc_df['price'].apply(fix_price)
    desc_df = desc_df.corr().loc[['counts','total', 'ratio'], 'price']
    desc_df = desc_df.rename({"counts": 'Covid Words Used in Description', 'total': 'Description Length', 'ratio': 'Ratio Between Covid Words to Description Length'})

    final_df = pd.DataFrame(pd.concat((name_df, desc_df)))
    fig = go.Figure(data=go.Table(header=dict(values = ['','price'],
                                              fill_color='paleturquoise',
                                              font_size=20
                                              ),
                                  cells=dict(
                                      values=[[x for x in final_df.index],[f'{x[0]:.3f}' for x in final_df.values]],
                                      fill_color='lavender'
                                  )),
                    layout=dict(title='Pearson Correlation Between Covid Words Occurrences in Names & Descriptions to Listing Price'))

    return fig


In [None]:
corr_covid_words_price_names_desc().show(renderer=RENDERER)

**There seems to be no correlation between the amount of Covid words in a name or description of a listing to its price during covid**
<br><br><br>
---

## Reviews

Exploring reviews data and differences due to Covid.

### Words in Reviews Pre & During Covid

In [None]:
def create_reviews_word_cloud(df, title, min_count = 2500):
    names = df["comments"].tolist()
    plt.figure(figsize = (15, 15), facecolor = None)
    plt.title(title)
    plt.imshow(create_word_cloud(names, min_count))

In [None]:
create_reviews_word_cloud(df_pre_reviews, "Pre-covid Reviews Word Cloud")

In [None]:
create_reviews_word_cloud(df_covid_reviews, "Covid Reviews Word Cloud")

### New Words Used in Review During Covid

In [None]:
new_words_after_covid([df_pre_reviews['comments'].tolist()],
                      [df_covid_reviews['comments'].tolist()],
                      "Top New Words in Reviews During Covid").show(renderer=RENDERER)

### Word Occurrences Changes Pre & During Covid

In [None]:
words_change([df_pre_reviews["comments"].tolist()], [df_covid_reviews["comments"].tolist()],
             "Top Words Occurrences Difference Pre-covid than During Covid in Reviews",
             ["Words Occurrences Difference Higher Pre-Covid ", "Words Occurences Difference Lower During Covid"]).show(renderer=RENDERER)

**Just like in the names and descriptions we can see that most of the new words used are Covid related words (Covid, pandemic, quarantine, WFH and more).**
<br><br><br>
---

### Sentiment Change in Reviews Pre & During Covid

In [None]:
sentiment_analysis_diff([df_pre_reviews["comments"].tolist()], [df_covid_reviews["comments"].tolist()],
                        "Sentiment Analysis for Pre and During Covid Reviews").show(renderer=RENDERER)

**Interestingly we can see that both subjectivity and polarity values have increased a bit during Covid. Meaning the reviews are more subjective and more positive.**
<br><br><br>
---

### Correlation Between Review to Ratings

#### Correlation Between Covid Words in Reviews to Ratings

In [None]:
def corr_covid_words_review(df_lst, df_review):
    df = pd.merge(df_lst, df_review, left_on='id', right_on='listing_id')[['comments','review_scores_rating']]
    counts_df = count_words_in_df(df, COVID_WORDS, 'comments')[['counts','total','ratio','review_scores_rating']]
    counts_df = pd.DataFrame(counts_df.corr().loc[['counts', 'total', 'ratio'], 'review_scores_rating'])
    counts_df = counts_df.rename({"counts": 'Covid Words Used in Comment', 'total': 'Comment Length', 'ratio': 'Ratio Between Covid Words to Comment Length',
                                  'review_scores_rating': 'Rating'})
    fig = go.Figure(data=go.Table(header=dict(values = ['','Rating'],
                                                  fill_color='paleturquoise',
                                                  font_size=20
                                                  ),
                                      cells=dict(
                                          values=[[x for x in counts_df.index],[f'{x[0]:.3f}' for x in counts_df.values]],
                                          fill_color='lavender'
                                      )),
                        layout=dict(title="Pearson Correlation Between Covid Words Occurrences in Reviews to Listing's Rating"))

    return fig

In [None]:
corr_covid_words_review(df_2020_lst,df_covid_reviews).show(renderer=RENDERER)

**As is with names and descriptions there is no correlation betweeen the amount of Covid words in a review to a listings ratings**
<br><br><br>
---

#### Correlation between Review Sentiment to Rating

In [None]:
def corr_sentiment_review_rating(df_lst, df_review, title):
    def convert_sentiment_to_cat(pol):
        if pol < -0.33:
            return -1
        elif pol < 0.33:
            return 0
        return 1

    df = pd.merge(df_lst, df_review, left_on='id', right_on='listing_id')[['id_x', 'comments', 'review_scores_rating']]
    sentiments_lst = get_sentiment_analysis([df.loc[:,'comments'].tolist()])
    df['Polarity'] = sentiments_lst['polarity']
    df['Subjectivity'] = sentiments_lst['subjectivity']
    sentiment_df = df.groupby('id_x').mean().reset_index()
    sentiment_df = sentiment_df.drop('id_x', axis=1)
    sentiment_df['Polarity'] = sentiment_df['Polarity'].apply(convert_sentiment_to_cat)
    sentiment_df['Subjectivity'] = sentiment_df['Subjectivity'].apply(convert_sentiment_to_cat)

    sentiment_df = sentiment_df.corr().loc[['Polarity', 'Subjectivity'], 'review_scores_rating']
    fig = go.Figure(data=go.Table(header=dict(values = ['','Rating'],
                                              fill_color='paleturquoise',
                                              font_size=20
                                              ),
                                  cells=dict(
                                      values=[[x for x in sentiment_df.index],[f'{x:.3f}' for x in sentiment_df.values]],
                                      fill_color='lavender'
                                  )),
                    layout=dict(title=title))

    return fig

In [None]:
corr_sentiment_review_rating(df_pre,df_pre_reviews,
                             'Pearson Correlation Between Mean Polarity & Subjectivity of Reviews to Listing Rating Before Covid').show(renderer=RENDERER)

In [None]:
corr_sentiment_review_rating(df_2020_lst, df_covid_reviews,
                             'Pearson Correlation Between Mean Polarity & Subjectivity of Reviews to Listing Rating at Beginning of Covid').show(renderer=RENDERER)

In [None]:
corr_sentiment_review_rating(df_21_sep_lst, df_covid_reviews,
                             'Pearson Correlation Between Mean Polarity & Subjectivity of Reviews to Listing Rating During Covid').show(renderer=RENDERER)

**There is very little correlation between the mean polarity & subjectivity of reviews to a listing's rating before and during Covid (both at the begining and in a later period)

### Reviews Ratings

In [None]:
def draw_box_compare(data, names):
    fig = go.Figure()
    for curr, name in zip(data, names):
        fig.add_trace(go.Box(x=data, name=name))
    return fig

draw_box_compare([df_pre['review_scores_rating'].tolist(),df_2020_lst['review_scores_rating'].tolist(), df_21_sep_lst['review_scores_rating'].tolist()],
                 ['Pre-covid','Start of Covid (June 2020)', 'During Covid (Septmeber 2021)']).show(renderer=RENDERER)

**Although the amount of reviews have changed drastically during these 3 periods the listing score distributions stayed the same. Showing that people scoring didn't change due to Covid, although we can see that their reviews text did and their subjectivity and polarity changed due to Covid.**
<br>
---

### Find Patterns Using Word Embeddings on Reviews

Here I will try to see if I can find any interesting patterns in reviews using word embeddings. I will use Word2Vec model for creationg embeddings from scratch from all the reviews data during Covid.

In [None]:
def tokenize_a_clean_corpus(corpus):
    tokenized_corpus = list()
    for sentence in corpus:
        clean_tokens = list()
        for word in sentence.split(" "):
            if re.match(r"[^\w\s]|[(\d)+]$",word) is not None or exclude_word(word):
                continue
            new_word = strip_punc_digit(word)
            if len(new_word) > 0:
                clean_tokens.append(new_word.lower())
        if len(clean_tokens) != 0:
            tokenized_corpus.append(clean_tokens)

    return tokenized_corpus


def train_word2vec_model(data: List[List[str]], embedding_size=100, window_size=7, min_word_count=5):
    corpus = clean_and_flatten_corpus(data)
    tokenized_corpus = tokenize_a_clean_corpus(corpus)
    model = Word2Vec(sentences=tokenized_corpus, vector_size=embedding_size, window=window_size, min_count=min_word_count, workers=4)
    return model

def get_similarity_word_cloud(source_words: List[str], model, num_words):
    similarity_factor = 100
    most_similar_words = list()
    for word in source_words:
        try:
            similar_words = model.wv.most_similar(word, topn=num_words)
            most_similar_words.append((word, similarity_factor*2))
            most_similar_words += [(curr[0], int(curr[1]*similarity_factor)) for curr in similar_words]
        except KeyError:
            pass

    most_similar_words = sorted(most_similar_words, key=lambda t: t[1])
    freq_quantile = [most_similar_words[len(most_similar_words)//3][1], most_similar_words[(len(most_similar_words)*2)//3][1]]
    words_dict = {word: val for word, val in most_similar_words}

    def coloring_func(word, *args, **kwargs):
        colors = [(17,225,0),(200,255,104), (255,255,104), (255,247,10)]
        freq = words_dict.get(word.lower(),0)
        idx=3
        if freq > freq_quantile[0]:
            idx = 1
        elif freq > freq_quantile[1]:
            idx = 2
        return  colors[0] if word.lower() in source_words else colors[idx]

    return create_word_cloud(words_dict, min_count=0, color_func=coloring_func)


def get_most_similar_words(words: List[str], model, num_words: int=10):
    most_similar_words = list()
    for word in words:
        try:
            similar_words = model.wv.most_similar(word, topn=num_words)
            most_similar_words += [x[0] for x in similar_words]
        except KeyError:
            pass

    return most_similar_words + words


def create_tsne_fig_of_words(words: List[str], model, title, color_values=None, perplexity=40, r_state=None):

    embeddings = list()
    used_words = list()
    for word in words:
        try:
            embeddings.append(model.wv[word])
            used_words.append(word)
        except KeyError:
            pass

    new_embeddings = TSNE(init='pca', perplexity=perplexity, n_iter=3000, random_state=r_state).fit_transform(embeddings)
    df = pd.DataFrame({"word": used_words,
                       "tsne1": [embed[0] for embed in new_embeddings],
                       "tsne2": [embed[1] for embed in new_embeddings]})
    color_map = None
    if color_values is not None:
        df['Type'] = df['word'].apply(color_values)
        color_map = {color: color for color in df['Type']}

    fig = px.scatter(df, x='tsne1', y='tsne2', text='word',
                     color='Type' if color_values is not None else None,
                     color_discrete_map=color_map, title=title)
    fig.update_traces(textposition='top center')
    return fig


def paint_similarity_word_cloud(source_words: List[str], model, title, num_words=50):
    wordcloud = get_similarity_word_cloud(source_words, model, num_words)
    plt.figure(figsize = (10, 10), facecolor = None)
    plt.title(title)
    plt.imshow(wordcloud)

In [None]:
covid_embeddings_model = train_word2vec_model([df_covid_reviews["comments"].tolist()], min_word_count=1)

#### Covid Related Words by Embedding Similarity Word Cloud

Here there is a word cloud of Covid Words fand their most similar words according to the Word2Vec model.
The colors of the words are according to similarity distance Green - most similar, yellow - least similar

In [None]:
paint_similarity_word_cloud(COVID_WORDS, covid_embeddings_model, 'Similar Words to Covid Words')

#### Words Similarity

Here I use T-SNE to decrease the embeddings dimension and display them in a 2D map.

In [None]:
create_tsne_fig_of_words(COVID_WORDS, covid_embeddings_model, 'Covid Words Similarity',perplexity=8, r_state=36)

From the graph above we see that model learned some logical patterns about the Covid words:
* Words related to sanitation are clustered together (bottom right).
* Words related to healthcare and regulations are clustered near one another.
* All the Covid words are clustered near one another together with the word virus in between (top left).
<br><br>
---
<br>Below is a graph displaying the similarity of some of the Covid words (I didn't used all to make this graph more readable) and other words from reviews. The words used are separated to 2 groups to words that have a good meaning in a review and words that have a bad meaning in a review

In [None]:
good_words = {'good', 'comfortable', 'nice', 'comfy', 'beautiful', 'amazing', 'awesome', 'great', 'excellent', 'amazing', 'stunning', 'lovely',
              'fantastic', 'perfect', 'incredible', 'fabulous', 'cool', 'wonderful', 'phenomenal', 'spectacular'}
bad_words = {'bad', 'uncomfortable', 'ugly', 'annoying', 'stink', 'stinks', 'unpleasant', 'annoying', 'suck', 'awful', 'disappointing', 'negative', 'mean',
             'frustrating', 'disappointed', 'dangerous', 'gross', 'inconvenient', 'messy', 'unsanitary', 'dirty', 'filthy'}
covid_words_tsne = {'covid', 'quarantine', 'pandemic', 'corona', 'disinfection', 'bacterial', 'germ', 'virus', 'sanitize', 'vaccine', 'pandemic', 'bacteria',
                    'isolation', 'disinfect', 'anti_bacterial'}

def coloring_func(word):
    if word in good_words:
        return 'green'
    if word in bad_words:
        return 'red'
    return 'yellow'

legend_dict = {'green': 'Good Words', 'red': 'Bad Words', 'yellow': 'Covid Words'}

to_use = covid_words_tsne.union(good_words).union(bad_words)
tsne_fig = create_tsne_fig_of_words(to_use, covid_embeddings_model, 'Covid and General Words Similarity', color_values=coloring_func, r_state=92)
tsne_fig.for_each_trace(lambda trace: trace.update(name=legend_dict[trace.name],
                                                   legendgroup = legend_dict[trace.name],
                                                   hovertemplate = trace.hovertemplate.replace(trace.name, legend_dict[trace.name])))


tsne_fig.show(renderer=RENDERER)

**In this grpah we can see that the good words are clustered together at the top left corner and most of the Covid words are clustered one near the other but are interlaced with other bad words. Showing that from the reviews the model learned that covid words are mostly words that have a bad meaning.**

### Topic Clustering on Reviews

Exploring to see if using Latent Dirichlet Allocation can cluster the comments, in different ways.
<br>
In this part I used only 30,000 reviews from ~500,000 due to memory limitations.

In [None]:
def create_counts_df(corpus: List[str], ngram_range=(1, 1)):
    stop_words = set(nltk_stopwords.words("english")).union(nltk_stopwords.words("spanish")).union(nltk_stopwords.words("french"))
    cv = CountVectorizer(analyzer="word", stop_words=stop_words, ngram_range=ngram_range, lowercase=True)
    term_matrix = cv.fit_transform(corpus)
    counts_df = pd.DataFrame(term_matrix.toarray(), columns=cv.get_feature_names_out())
    return counts_df


def create_and_train_lda_model(counts_df: pd.DataFrame, num_of_topics):
    df = counts_df.apply(lambda word_count: word_count > 0)
    sentences_clean = df.apply(lambda row: list(counts_df.columns[row.values]), axis=1)
    dictionary = gensim.corpora.Dictionary(sentences_clean)
    topic_modeling = [dictionary.doc2bow(text) for text in sentences_clean]

    lda_model = gensim.models.ldamodel.LdaModel(corpus=topic_modeling, id2word=dictionary, num_topics=num_of_topics, random_state=100, update_every=0,
                                                chunksize=30, passes=10, alpha='symmetric', iterations=100, per_word_topics=True)

    return lda_model, topic_modeling


def create_lda_vis(data: List[List[str]], num_of_topics, ngram_range=(1, 1)):
    corpus = clean_and_flatten_corpus(data)
    counts_df = create_counts_df(corpus, ngram_range)
    lda_model, topic_modeling = create_and_train_lda_model(counts_df, num_of_topics)
    vis = gensimvis.prepare(lda_model, topic_modeling, dictionary=lda_model.id2word)
    return vis

#### Clustering Reviews by Ratings (1-5)

In [None]:
create_lda_vis([df_pre_reviews["comments"].tolist()[:10_000], df_covid_reviews["comments"].tolist()[:10_000]], num_of_topics=5, ngram_range=(1,1))

This method failed, clustering for rating doesn't work

#### Clustering Reviews by Pre & Post Covid comments

In [None]:
create_lda_vis([df_pre_reviews["comments"].tolist()[:10_000], df_covid_reviews["comments"].tolist()[:10_000]], num_of_topics=2, ngram_range=(1,1))

**This clustering did split the comments into 2 distinct clusters but none of the Covid words appear as the top words, and their sizes is significanlty differnt, meaning it didn't split well between based on different periods.**