This script will read the output from VADER analysis, add a positive/negative or neutral label based on calculated compund and then create a sampled dataset. The sampled records can be used to calculate the VADER model accuracy.

#### import modules

In [1]:
import os
import pandas as pd
import numpy as np

#### Read data and perform sampling

In [2]:
df = pd.read_csv("vaderCityOut.csv.gz", compression = "gzip")

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# add a new column forlabel predicted by Vader (using compund column)
conditions = [
    (df['compound'] < 0),
    (df['compound'] > 0),
    (df['compound'] == 0)
     ]

values = ['negative', 'positive', 'neutral']

df['vader label'] = np.select(conditions, values)

df.head()

Unnamed: 0,listing_id,neighborhood,price,property,room,rating,zipcode,date,comments,COVID,city,source,neg,neu,pos,compound,vader label
0,,,$$$,,,,28803,2021-02-28,Clean with Spectacular Service This hotel was ...,Post-,Asheville,TripAdvisor,0.009,0.732,0.259,0.9964,positive
1,,,$$$,,,,28803,2021-01-31,Awesome Hotel! My boyfriend and I had a weeken...,Post-,Asheville,TripAdvisor,0.0,0.559,0.441,0.9782,positive
2,,,$$$,,,,28803,2021-02-28,"Great hotel Decided on a trip to Asheville, it...",Post-,Asheville,TripAdvisor,0.0,0.606,0.394,0.9843,positive
3,,,$$$,,,,28803,2021-02-28,Great hotel We loved this hotel! Beautifully d...,Post-,Asheville,TripAdvisor,0.034,0.528,0.438,0.9769,positive
4,,,$$$,,,,28803,2021-02-28,"Essence of ""Southern Hospitality""! This is a w...",Post-,Asheville,TripAdvisor,0.0,0.728,0.272,0.9848,positive


In [4]:
# create separate df for each comment type
df1 = (df[df['vader label']=='negative'])
df2 = (df[df['vader label']=='positive'])
df3 = (df[df['vader label']=='neutral'])

In [5]:
print ('there are %d records with negative , %d records with positive and %d records with neutral comments per VADER analysis.'%(len(df1),len(df2),len(df3))) 

there are 297350 records with negative , 9736272 records with positive and 166722 records with neutral comments per VADER analysis.


In [6]:
# create a sample df based on pos/neg/ and neutral comments ratio
# neg_samples = int(np.ceil(len(df1)/len(df)*100))
# pos_samples = int(np.ceil(len(df2)/len(df)*100))
# neu_samples = int(np.ceil(len(df3)/len(df)*100))
# frames =  [df1.sample(neg_samples),df2.sample(pos_samples), df3.sample(neu_samples)]  

# join the samlped data frames
frames =  [df1.sample(30),df2.sample(50), df3.sample(20)]   
sampled_df = pd.concat (frames, axis = 0)

In [7]:
# prevent truncating the comments
pd.set_option("display.max_colwidth", 2000)
sampled_df.head(1)

Unnamed: 0,listing_id,neighborhood,price,property,room,rating,zipcode,date,comments,COVID,city,source,neg,neu,pos,compound,vader label
133161,,,$$$$,,,,2116,2019-08-31,Family situation. Loews pulled through. We had to cancel a last minute trip due to a sudden family death. \n\nWe didn't know if we would be penalized because of our cancellation. \n\nLoews was extremely accommodating and understanding. \n\nWe plan on eventually making a trip to Boston. And Loews will certainly be our place to stay.,Pre-,Boston,TripAdvisor,0.103,0.855,0.042,-0.5423,negative


#### write output file in .csv format

In [8]:
sampled_df.to_csv("vader sample.csv", index=False)