### Topic Frequency Calculator
> - Import 3 datasets:  Topic list, Reviews, Listings 
> - Preprocess the words in the same way we did for the model section 
> - Build the frequency counter and evaluate frequency per topic per listing and save the frequencies in a dataframe
> - Filter the listings based on the average frequency keeping only the ones in a specified range to ensure a smoother further analysis
> - Merge the frequencies with the listings dataset using Listing ID
> - Save the obtained dataframe for further analysis

### Step 1: Imports


In [4]:
import pandas as pd

In [3]:
#Import data
topics= pd.read_csv('C:/Users/chris/Desktop/topics_extended_stopwords.csv').drop(["Unnamed: 0"], axis=1)
reviews=pd.read_csv('C:/Users/chris/Desktop/reviews.csv')
listings=pd.read_csv('C:/Users/chris/Desktop/listings.csv')

In [4]:
#Create sample to test the functions (now disabled, for simplicity i just called the entire df as the sample to not change all the code)
reviews_sample=reviews.sort_values(by=["listing_id"])#.iloc[100:10000]

### Step 2: Preprocessing

In [5]:
#preprocess words
import re
reviews_sample.drop(columns=['id', 'reviewer_id', 'reviewer_name'], axis=1, inplace=True)

reviews_sample['comments_processed'] = reviews_sample['comments'].apply(lambda x: re.sub('[,\.!?]', '', str(x)))
reviews_sample['comments_processed'] = reviews_sample['comments'].map(lambda x: re.sub('brbr', ' ', re.sub('[^a-zA-Z ]', '', str(x)))) 
reviews_sample['comments_processed'] = reviews_sample['comments_processed'].map(lambda x: x.lower())

In [6]:
reviews_sample

Unnamed: 0,listing_id,date,comments,comments_processed
0,13913,2010-08-18,My girlfriend and I hadn't known Alina before ...,my girlfriend and i hadnt known alina before w...
20,13913,2020-02-22,Outstanding host. Got along great.,outstanding host got along great
19,13913,2019-11-25,Alina’s place is great! It’s very stylish and ...,alinas place is great its very stylish and cos...
18,13913,2019-10-07,Felt at home - Alina is an excellent host - ve...,felt at home alina is an excellent host very...
17,13913,2019-10-02,Alina is a very relaxed and friendly host who ...,alina is a very relaxed and friendly host who ...
...,...,...,...,...
1042999,53622933,2021-12-05,Gregory is an absolutely amazing host! He went...,gregory is an absolutely amazing host he went ...
1043000,53629457,2021-12-04,Those considering the aptm as a last minute bo...,those considering the aptm as a last minute bo...
1043001,53656459,2021-12-06,One of the worst places I have ever stayed... ...,one of the worst places i have ever stayed ver...
1043002,53657036,2021-12-05,An exceptional little apartment for a short st...,an exceptional little apartment for a short st...


### Step 3: Frequency counter

In [15]:
#----- set variables 
topic_frequency={'0':[],"1":[],"2":[],"3":[],"4":[],"5":[],"6":[],"7":[],"8":[],"9":[],"10":[],"11":[],"12":[],"13":[],"14":[],"15":[],"16":[],"17":[],"18":[],"19":[]}

id=[]

#----- Start frequency count
for row in range():
    id.append(reviews_sample.iloc[row]["listing_id"])
    for topic in range(20):
        c=0
        for word in reviews_sample.iloc[row]["comments_processed"].split():
            if word in topics["Topic_"+str(topic+1)].tolist():
                c+=1
        topic_frequency[str(topic)].append(c)
        
# Incorporate id in the data
topic_frequency["id"]=id

#Dataframe creation
topic_frequency=pd.DataFrame(data=topic_frequency)

#groupby and sum
topic_frequency= topic_frequency.groupby(by=["id"]).sum()

In [19]:
topic_frequency.to_csv('C:/Users/chris/Desktop/topics_frequency.csv')

### Step 4: Filtering and Merging

In [29]:
topic_frequency["Avg"]=topic_frequency.mean(axis=1)
topic_frequency.sort_values(by="Avg",inplace=True)
topic_frequency_join= topic_frequency[(topic_frequency["Avg"]>=10) & (topic_frequency["Avg"]<=500)]


df_out=pd.merge(topic_frequency_join,listings,on="id").rename(columns={"0":"Topic 0",
                                                         "1":"Topic 1",
                                                         "2":"Topic 2",
                                                         "3":"Topic 3",
                                                         "4":"Topic 4",
                                                         "5":"Topic 5",
                                                        "6":"Topic 6",
                                                                 "7":"Topic 7",
                                                                 "8":"Topic 8",
                                                                 "9":"Topic 9",
                                                                 "10":"Topic 10",
                                                                 "11":"Topic 11",
                                                                 "12":"Topic 12",
                                                                 "13":"Topic 13",
                                                                 "14":"Topic 14",
                                                                 "15":"Topic 15",
                                                                 "16":"Topic 16",
                                                                 "17":"Topic 17",
                                                                 "18":"Topic 18",
                                                                 "19":"Topic 19",
                                                                 "20":"Topic 20",})
topic_frequency.rename(columns={"0":"Topic 0",
                 "1":"Topic 1",
                 "2":"Topic 2",
                 "3":"Topic 3",
                 "4":"Topic 4",
                 "5":"Topic 5",
                "6":"Topic 6",
                 "7":"Topic 7",
                 "8":"Topic 8",
                 "9":"Topic 9",
                 "10":"Topic 10",
                 "11":"Topic 11",
                 "12":"Topic 12",
                 "13":"Topic 13",
                 "14":"Topic 14",
                 "15":"Topic 15",
                 "16":"Topic 16",
                 "17":"Topic 17",
                 "18":"Topic 18",
                 "19":"Topic 19",
                 "20":"Topic 20",}).to_csv('C:/Users/chris/Desktop/Listings+Frequency_Complete.csv')

### Step 5: Saving the dataframe

In [32]:
df_out.to_csv('C:/Users/chris/Desktop/Listings+Frequency_Reduced.csv')