#Do Elite Reviewers Tend to Review the Unreviewed?

Our question asks if elite reviewers tend to review businesses if they do not have many reviews. We can't get inside the heads of the elite reviewers, but as businesses have more reviews, are more of the reviews written by users who are or have been elite, or are more of the reviews written by non-elite users? In other words, do places with higher review counts, meaning that it is popular, tend to be targetted by elite reviewers or is it indeed the opposite where they tend to want to review an unpopular business with less reviews. Because we do not know how an elite reviewers thinks and decide which businesses to review, we will need to draw our insight from somewhere else. To approach this question, we have dive into the Yelp dataset and take a look at some of fields we need for our analysis. We will be taking these fields from three different files: business.json, user.json, review.json.

#Business Data
To start off, business data from the business JSON file is needed in order to match review count and reviews to each business. We will be loading this data into a dataframe to contain the columns that we select. The `business_id` is needed to uniquely identify each business, and the `review_count` to get the amount of reviews that each business has.

In [None]:
df_business = spark.read.json("/yelp/business.bz2").select("business_id", "review_count").cache()
df_business.show()

The next cell is to condense any reviews past 100 to be equal to 100. This way we don't have a long tail for the histogram. The max # of reviews can be adjusted but we condense them all because businesses with more than 100 reviews accounts for a very small % of total reviews, so adding anything from 100 to 3000 reviews to just be the same thing gives a better idea of the trend.

In [None]:
df_business.createOrReplaceTempView("business")
df_business_summary = spark.sql("""
SELECT IF(review_count>100, 100, review_count) as review_count, business_id
FROM business
""")

#Taking a Look at our Data

After manipulating our data, we wanted to see how our data is distributed in the form of a histogram. Creating this visualization helped us gain a better understanding of how the data is laid out so that we can determine the appropriate minimum and maximum that captures most of the businesses in the dataset. The histogram was created using the field `review_count` and density was automatically calculated to describe the percentage of businesses in relation to the whole. As you can see, the distribution is skewed to the right with the review counts larger than 100 aggregated to the value 100. We set the max count at 100 since businesses that have more than 100 reviews only makes up for 8% of the data. The reviews past the first 100 will additionally be filtered out later on because each additional review after the first 100 provides less information about the overall trend of the information while also increasing the amount of data that would need to be processed.

In [None]:
df_business_summary.display()

review_count,business_id
36,f9NumwFMBDn751xgFiRbNA
4,Yzvjg0SayhoZgCljUJRF9Q
5,XNoUzKckATkOD1hP6vghZg
3,6OAZjbxqM5ol29BuHsil3w
26,51M2Kk903DFYI6gnB5I6SQ
38,cKyLV5oWZJ2NudWgqs8VZw
81,oiAlXZPIFm2nBCt0DHLu_Q
18,ScYkbYNkDgCneBrD9vqhCQ
5,pQeaRpvuhoEqudo3uymHIQ
16,EosRKXIGeSWFYWwpkbhNnA


# Profiling Number of Records Based on Minimum Review account

This is the profiling of each minimum that was set to see what amount of reviews should be set as the minimum amount of records for the histogram to be useful. The table shown below details the number of businesses that have a certain amount of reviews

| Min | # of records |
| ------- |
| >=20  |  66117  | 
| >=30  |  48922  | 
| >=40 | 38765 | 
| >=50 | 31942 | 
| >=60 | 26922 | 

Based on this table we chose our minimum to be 30 reviews which gives us 48,922 records to work with. We decided on this number because selecting a minimum lower than that would skew our results for the proption of elite and non-elite reviewers. For instance one elite review out of 20 reviews already accounts for 5% of the reviews. Selecting a minimum that was too high would result in lots of busineses being filtered out which we do not want. We want to account for as many businesses as possible. If we selected our minimum as 40 compared to 30, we are losing roughly 10,000 business that we could use in our data.

In [None]:
df_business_summary = df_business_summary.filter("review_count>=30")
df_business_summary.count()

#Users Data
User data is needed in order to decipher if users are elite/non-elite and how many years they have been actively classified as an elite user. In order to do so we will be importing the `user_id` field to determine the unique users that are creating reviews, and we will be importing the `elite` field to see what years they've been elite. This allows for being able to see the difference between when yelp reviews are categorized as elite per the year that they reviewed as elite reviewers, or whether all of the reviews from those who have been elite reviewers are considered to be elite reviews.

In [None]:
df_users = spark.read.format('parquet').table("users_table").select("user_id", "elite").cache()
df_users.show()

#Reviews Data
To show when elite users are submitting their reviews, we will extract the `date` that the reviews were conducted. This data will allows us to sort between elite and non-elite users. To match the reviews with each uniquely identified bussiness, the `business_id` will be imported, too. We will match the users to each review by importing the `user_id`.

In [None]:
df_reviews = spark.read.format('parquet').table("reviews_without_text_table").select("business_id", "user_id", "date").cache()
df_reviews.show()
df_reviews.count()

In [None]:
#This cell creates temp views to prepare for the next cell's join function
df_users.createOrReplaceTempView("users")
df_reviews.createOrReplaceTempView("reviews")

### Identifying Elite Status 
The query below will select: `elite_years`, `user_id`, `bussiness_id`, and `date`. The data extracted will help us visualize each user's elite_status by year and the date that they conducted the review. We are using the `SPLIT` function to make the elite years an array so we can filter by elite-users later on. The data will match the review to a user via common `user_id` so that we can later determine whether or not each individual review is considered an elite review or not.

In [None]:
#This cell joins the reviews to the users through the left outer join and splits the elite_years into an array
df_user_reviews = spark.sql("""
SELECT SPLIT(u.elite,'\s*,\s*') as elite_years, U.user_id, R.business_id, R.date
FROM reviews AS R INNER JOIN users AS u
ON R.user_id = U.user_id
""")
df_user_reviews.show()

In [None]:
#Unpersist dataframes that are no longer used to free up memory
df_users.unpersist()
df_reviews.unpersist()

#Identifying Elite Status by Matching Year
The dateframe below will check if the year of the review matches any of year that the user was an elite member. The `is_elite1` column will display `TRUE` if their is any years in the array of `elite_years`, and `FALSE` is no years are present in the array. This checks whether the user was an elite user in any given year.

In [None]:
df_user_reviews.createOrReplaceTempView("user_reviews")
df_user_elite_1 = spark.sql("""
SELECT elite_years, user_id, business_id, date, IF(ARRAY_CONTAINS(elite_years, CAST(YEAR(date) AS STRING)), True, False) as is_elite1
FROM user_reviews
""")
df_user_elite_1.show(100)

#Identifying Elite Status by Elite History
Now that we've determined the Elite status by seeing if the year that the review was posted matches the year the user was considered Elite, we will determine the Elite Status of a yelp user by using their yelp status history. This will help see if the behavior of a user changes based on their Elite status or not. Will users who were once considered Elite yield the same results?

If a Yelp user was an elite user from 2013 to 2015, we would determine that user as an Elite user, regardless if they kept their status or not. In order to do this, an "If-Then" statement is created to see if the Elite years contain any data for the years. Currently the elite-years column contains arrays of years that a user was considered Elite. If the `elite_years column` contains data, then the `is_elite2` column would populate as `true` (if there is no data, then it will be "false").

In [None]:
df_user_elite_1.createOrReplaceTempView("user_elite1")
df_user_elite_2 = spark.sql("""
SELECT elite_years, user_id, business_id, date, is_elite1, IF(elite_years[0] <> '', true, false) AS is_elite2
FROM user_elite1
""")
df_user_elite_2.show(100)

#Identifying Elite Status Based On First Year Elite
Finally, let's look at another way of checking for elite status. We want to see if In the next cell, we will determine the Elite status starting from the year that the user recieved the status. The user would be considered Elite regardless of whether or not they lose the status afterwards.

Essentially the user is considered elite STARTING from the first year that they were elite. Here we say if the `MIN` year of being an Elite user (aka the first year that a user is elite) is `<=` the year of the review, then they are an elite user.

In [None]:
df_user_elite_2.createOrReplaceTempView("user_elite_2")
df_user_elite_3 = spark.sql("""
SELECT elite_years, user_id, business_id, date, is_elite1, is_elite2, IF(ARRAY_MIN(elite_years) <= YEAR(date), true, false) AS is_elite3
FROM user_elite_2
GROUP BY business_id, elite_years, user_id, date, is_elite1, is_elite2
""")
df_user_elite_3.show(100)

In [None]:
#This cell creates temp views to prepare for the next cell's join function
df_business_summary.createOrReplaceTempView("business_summary")
df_user_elite_3.createOrReplaceTempView("user_elite_3")

#Identifying the Elite Statuses for the Business in Each Review
Now that we have the three methods of determining Elite status with the columns `is_elite1`, `is_elite2`, and `is_elite3`, we will need to match the businesses to the reviews. We want to identify the Elite Statuses (using the three different methods we’ve created) for the businesses in each of the reviews. This next join matches the businesses with the reviews/reviewers (which was joined from all the previous dataframes) and will include the column, `review_count`. We will need each businesses’ review count to understand the total review count that each business has.

In [None]:
#This cell joins the user_elite_3 dataframe with the business_summary dataframe through the "business_id" column

df_business_reviews = spark.sql("""
SELECT ue.date, ue.elite_years, bs.review_count, ue.user_id, bs.business_id, ue.is_elite1, ue.is_elite2, ue.is_elite3
FROM business_summary AS bs INNER JOIN user_elite_3 AS ue
ON bs.business_id = ue.business_id
ORDER BY business_id
""")
df_business_reviews.show(500,truncate=True)
df_business_reviews.count()

In [None]:
#Unpersist dataframes that are no longer used to free up memory
df_business_summary.unpersist()
df_user_elite_1.unpersist()
df_user_elite_2.unpersist()
df_user_elite_3.unpersist()

#Preparing the Dataframe for Visualization
Now that we have our three methods of determining Elite statuses, it's time to put the dataframe together with the three different methods of determining an Elite status. We have included the columns: `user_id`, `is_elite1`, `is_elite2`, and `is_elite3`, `business_id`, and `review_count` into one single dataframe. But in order to accurately view our data in chronological order, we have also created a review_order column. The `review_order` column lists the reviews in chronological order by `business_id` and based on the date the review was posted.  The `review_order` column will also help determine the bin number that each review will be categorized in the next few steps.

In [None]:
df_business_reviews.createOrReplaceTempView("business_reviews")
df_ordered_reviews = spark.sql("""
SELECT review_count, user_id, is_elite1, is_elite2, is_elite3, business_id, date, elite_years, ROW_NUMBER() OVER (PARTITION BY business_id ORDER BY date) AS review_order
FROM business_reviews
""")
df_ordered_reviews.show(truncate=True)

In [None]:
#Unpersist dataframes that are no longer used to free up memory
df_business_reviews.unpersist()

#Excess Reviews

Here, we are filtering out excess reviews (from `ordered_reviews`) per business so that the count of business reviews (in `review_count`) match the quantity of total reviews that a business has. This also means that we are getting rid of reviews past the 100th review, but as we determined earlier via the histogram in the beginning of the notebook, the number of businesses with more than 100 reviews only makes up about 8% of the data, so filtering out reviews past the 100th review wouldn't change any apparent trends in the data for elite reviews.

In [None]:
df_ordered_reviews.createOrReplaceTempView("ordered_reviews")
df_filtered_reviews = spark.sql("""
SELECT  review_count, user_id, is_elite1, is_elite2, is_elite3, business_id, date, elite_years, review_order
FROM ordered_reviews
WHERE review_order<=review_count
""")
df_filtered_reviews.show()
df_filtered_reviews.count()

In [None]:
#Unpersist dataframes that are no longer used to free up memory
df_ordered_reviews.unpersist()

#Determining the Range for Review Counts

In this next cell, we are splitting the range of data into equal sized bins of ten (review_count). We want to visualize the distribution of elite reviews without showing numerous data points of every review that a business has.

In [None]:
df_filtered_reviews.createOrReplaceTempView("filtered_reviews")
df_ordered_bins = spark.sql("""
SELECT review_count, user_id, business_id, date, is_elite1, is_elite2, is_elite3, elite_years, review_order, 
CASE
  WHEN review_order <= 10 THEN 1
  WHEN review_order <= 20 THEN 2
  WHEN review_order <= 30 THEN 3
  WHEN review_order <= 40 THEN 4
  WHEN review_order <= 50 THEN 5
  WHEN review_order <= 60 THEN 6
  WHEN review_order <= 70 THEN 7
  WHEN review_order <= 80 THEN 8
  WHEN review_order <= 90 THEN 9
  WHEN review_order >= 91 THEN 10
  ELSE -1
END AS bin_number
FROM filtered_reviews
""")
df_ordered_bins.show(truncate=True)

In [None]:
#Unpersist dataframes that are no longer used to free up memory
df_filtered_reviews.unpersist()

#Saving Our Findings

In this cell, we are creating our final dataframe before saving it as a table.

In [None]:
df_ordered_bins.createOrReplaceTempView("ordered_bins")
df_ordered_bins.count()

The cell below saves the table in Databricks to be able to send the data to Tableau for visualization.

In [None]:
df_ordered_bins.write.mode("overwrite").saveAsTable("DF_Final")

##Functions for Visualization

The next 2 cells import functions from the Pillow python library to be able to grab images from our Databricks database and display them as HTML. This will allow us to show the visualizations that have been created in Tableau.

In [None]:
from PIL import Image
TEMP_DIR = "/temp"

def getWidth(path):
  with Image.open(path) as img:
    width, height = img.size
    return(width)
  
def getDbfsPathName(path):
    # Get the fileinfo containing the path and name
  if path.startswith("/dbfs") != True:
    raise Exception("The path provided does not start with /dbfs")
  new_path = "dbfs:" + path[5:]
  # get the file info for the path
  file_list = dbutils.fs.ls(new_path)
  if len(file_list) != 1:
    raise Exception("The path provided is not a single file on dbfs")
  dbfs_path = file_list[0].path
  filename = file_list[0].name
  return(dbfs_path, filename)
  
def getTempPath(filename):
  # Create the temp directory if it does not exist
  temp_path = "file:" + TEMP_DIR
  dbutils.fs.mkdirs(temp_path)
  temp_list = dbutils.fs.ls(temp_path)
  # get a name to use for the copy
  temp_files = []
  for info in temp_list:
    temp_files.append(info.name)
  increment = 0
  new_name = filename
  while new_name in temp_files:
    increment+=1
    new_name = filename + "." + str(increment)
  access_path = TEMP_DIR + "/" + new_name # used for file opening
  return(access_path)

In [None]:
import base64
from PIL import Image

def showimage(path, width=0):
  image_string = ""
  img_tag = ""
  dbfs_path, filename = getDbfsPathName(path)
  access_path = getTempPath(filename)
  # copy the file
  copy_path = "file:" + access_path
  dbutils.fs.cp(dbfs_path,copy_path)
  with open(access_path, "rb") as image_file:
    image_string = base64.b64encode(image_file.read() ).decode('utf-8') 
    
  # Is the width setting a positive integer?  A width of 50 means 50%
  if width > 0 and width < 1:
    print("If the width parameter is specified, it must be 1 or more.  A width of 50 means 50%. The width entered was " + str(width) + ", so the original image width was used.")
    width = 0 #reset
    
  if width == 0:
    height = 0
    # Get the width and height of the image in pixels
    with Image.open(access_path) as img:
      width, height = img.size
      
    framewidth = width * 1.1
    # Build the image tag
    img_tag = '''
    <style>
    div {
      min-width: %ipx;
      max-width: %ipx;
    }
    </style>
    <div><img src="data:image/png;base64, %s"  style="width:%ipx;height=%ipx;" /></div>''' % (framewidth,framewidth,image_string, width, height)
  else: # a width was specified
    originalWidth = getWidth(access_path)
    imagewidth = int( width / 100.0 * originalWidth)
    framewidth = int( imagewidth * 1.1 )
    # Build the image tag
    img_tag = '''
    <style>
    div {
      min-width: %ipx;
      max-width: %ipx;
    }
    </style>
    <div><img src="data:image/png;base64, %s"  width="%ipx" height="auto"></div>''' % (framewidth,framewidth,image_string, imagewidth)
  # Clean up the file
  dbutils.fs.rm(copy_path)
  return(img_tag)

##Visualizations for each elite categorization method

All of the visualizations were done in Tableau by putting the `review_order` measure in the row shelf and selecting histogram from the list of options available for the visualizations. For each separate visualization, the `elite_1`, `elite_2`, and `elite_3` was dragged into the color section to display the difference between the amount of elite and non-elite reviews.

In this next cell, we created a histogram of Elite and Non-Elite reviews. We used our first method of elite review categorization, which is to check whether or not the year of the review matches with a year that the user was elite. If it matched, then the review was considered elite, if it didn't match it was considered a non-elite review.

In [None]:
displayHTML( showimage("/dbfs/FileStore/tables/Elite_1.png", 100) )

This next histogram is created using the second method of elite review categorization, where if a user was ever elite then the review that they created would be considered an elite review.

In [None]:
displayHTML( showimage("/dbfs/FileStore/tables/Elite_2.png", 100) )

This last histogram was created using the third method of elite review categorization, where starting from the first year that a user was elite, all of their reviews from then on would be considered elite reviews.

In [None]:
displayHTML( showimage("/dbfs/FileStore/tables/Elite_3.png", 100) )

In all of these histograms we see the same trend of elite to non-elite reviews. The total number of reviews does not start lowering until after 30 as we selected 30 to be the minimum amount of reviews that a business must have before the business is considered within our data. Afterwards, we can see that there are fewer and fewer businesses that have higher total number of reviews. Based on the bars in the histogram, we can see that for all of the elite categorizations, the ratio of elite to non-elite reviews seems to trend downward as businesses get more reviews. In the next section we will be examining exactly what trend in the ratio is and what information we get gain from getting a more precise look at the numbers in the data.

#Getting the Elite:Non-Elite Ratio

The next 9 cells are to get a count of the number of elite and non-elite reivews for each bin based on our three different methods of determining whether or not a user is considered elite. After we get a count of each elite and non-elite reviews, we bring the data into a final table where we can divide the elite reviews by the non-elite reviews to get a ratio of elite to non-elite reviews in a single column. This way, we can see the ratio of elite to non-elite reviewers and the relation between elite reviews and each bin for the reviews.

In [None]:
#Counting total Elite reviews for the first method of determining Elite reviewers
df_true_elite1_total = spark.sql(""" 
SELECT bin_number, COUNT(is_elite1) AS is_elite1_true
FROM ordered_bins
WHERE is_elite1 = 'true'
GROUP BY bin_number, is_elite1
ORDER BY bin_number
""")
df_true_elite1_total.show()

In [None]:
#Counting total Non-Elite reviews for the first method of determining Elite reviewers
df_false_elite1_total = spark.sql(""" 
SELECT bin_number, COUNT(is_elite1) AS is_elite1_false
FROM ordered_bins
WHERE is_elite1 = 'false'
GROUP BY bin_number, is_elite1
ORDER BY bin_number
""")
df_false_elite1_total.show()

In [None]:
df_true_elite1_total.createOrReplaceTempView("elite1_true")
df_false_elite1_total.createOrReplaceTempView("elite1_false")
#Joining the Elite/Non-Elite reviews into a single table and getting the ratio for each bin
df_elite1_ratio = spark.sql("""
SELECT e1t.bin_number, e1t.is_elite1_true, e1f.is_elite1_false, e1t.is_elite1_true/e1f.is_elite1_false AS elite1_ratio
FROM elite1_true AS e1t INNER JOIN elite1_false AS e1f
ON e1t.bin_number = e1f.bin_number
ORDER BY bin_number
""")
df_elite1_ratio.show()

In [None]:
#Unpersist dataframes that are no longer used to free up memory
df_true_elite1_total.unpersist()
df_false_elite1_total.unpersist()

In [None]:
#Counting total Elite reviews for the second method of determining Elite reviewers
df_true_elite2_total = spark.sql(""" 
SELECT bin_number, COUNT(is_elite2) AS is_elite2_true
FROM ordered_bins
WHERE is_elite2 = 'true'
GROUP BY bin_number, is_elite2
ORDER BY bin_number
""")
df_true_elite2_total.show()

In [None]:
#Counting total Non-Elite reviews for the second method of determining Elite reviewers
df_false_elite2_total = spark.sql(""" 
SELECT bin_number, COUNT(is_elite2) AS is_elite2_false
FROM ordered_bins
WHERE is_elite2 = 'false'
GROUP BY bin_number, is_elite2
ORDER BY bin_number
""")
df_false_elite2_total.show()

In [None]:
df_true_elite2_total.createOrReplaceTempView("elite2_true")
df_false_elite2_total.createOrReplaceTempView("elite2_false")
#Joining the Elite/Non-Elite reviews into a single table and getting the ratio for each bin
df_elite2_ratio = spark.sql("""
SELECT e2t.bin_number, e2t.is_elite2_true, e2f.is_elite2_false, e2t.is_elite2_true/e2f.is_elite2_false AS elite2_ratio
FROM elite2_true AS e2t INNER JOIN elite2_false AS e2f
ON e2t.bin_number = e2f.bin_number
ORDER BY bin_number
""")
df_elite2_ratio.show()

In [None]:
#Unpersist dataframes that are no longer used to free up memory
df_true_elite2_total.unpersist()
df_false_elite2_total.unpersist()

In [None]:
#Counting total Elite reviews for the third method of determining Elite reviewers
df_true_elite3_total = spark.sql(""" 
SELECT bin_number, COUNT(is_elite2) AS is_elite3_true
FROM ordered_bins
WHERE is_elite3 = 'true'
GROUP BY bin_number, is_elite3
ORDER BY bin_number
""")
df_true_elite3_total.show()

In [None]:
#Counting total Non-Elite reviews for the third method of determining Elite reviewers
df_false_elite3_total = spark.sql(""" 
SELECT bin_number, COUNT(is_elite3) AS is_elite3_false
FROM ordered_bins
WHERE is_elite3 = 'false'
GROUP BY bin_number, is_elite3
ORDER BY bin_number
""")
df_false_elite3_total.show()

In [None]:
df_true_elite3_total.createOrReplaceTempView("elite3_true")
df_false_elite3_total.createOrReplaceTempView("elite3_false")
#Joining the Elite/Non-Elite reviews into a single table and getting the ratio for each bin
df_elite3_ratio = spark.sql("""
SELECT e3t.bin_number, e3t.is_elite3_true, e3f.is_elite3_false, e3t.is_elite3_true/e3f.is_elite3_false AS elite3_ratio
FROM elite3_true AS e3t INNER JOIN elite3_false AS e3f
ON e3t.bin_number = e3f.bin_number
ORDER BY bin_number
""")
df_elite3_ratio.show()

In [None]:
#Unpersist dataframes that are no longer used to free up memory
df_true_elite3_total.unpersist()
df_false_elite3_total.unpersist()

#Ratio Visualization

To visualize the trend of ratio of elite to non-elite reviews, we put the ratio of elite reviews to each bin of 10 reviews on a line graph for each method of elite categorization.

The cell below shows the trend for the categorization of elite reviews for when the year of the review matches a year that the user was elite.

In [None]:
df_elite1_ratio.display()

bin_number,is_elite1_true,is_elite1_false,elite1_ratio
1,105852,383368,0.2761106821643955
2,89978,399242,0.2253720800917739
3,83312,405908,0.2052484799511219
4,68634,360923,0.1901624446211519
5,55662,292392,0.190367725519166
6,45780,244588,0.1871718972312623
7,39203,209104,0.1874808707628739
8,33908,182347,0.1859531552479613
9,29647,160272,0.1849792852151342
10,26388,142199,0.1855709252526389


In the next cell, the line graph shows the trend when reviews are considered elite if the user was ever an elite reviewer.

In [None]:
df_elite2_ratio.display()

bin_number,is_elite2_true,is_elite2_false,elite2_ratio
1,145829,343391,0.4246733315666398
2,121605,367615,0.330794445275628
3,111747,377473,0.2960397167479528
4,94253,335304,0.2810971536277527
5,76421,271633,0.2813391598222602
6,63117,227251,0.2777413520732582
7,53963,194344,0.2776674350635986
8,46862,169393,0.2766466146771118
9,41165,148754,0.276732054264087
10,36429,132158,0.2756473312247461


In this last cell the line graph shows the trend when all reviews are considered elite starting from the first year that a user was elite.

In [None]:
df_elite3_ratio.display()

bin_number,is_elite3_true,is_elite3_false,elite3_ratio
1,111817,377403,0.2962801037617613
2,96033,393187,0.2442425614275141
3,91161,398059,0.2290137894131272
4,77899,351658,0.2215192033168589
5,63553,284501,0.2233841005831262
6,52390,237978,0.220146400087403
7,44979,203328,0.2212139990557129
8,39013,177242,0.2201114859909051
9,34295,155624,0.2203708939495193
10,30555,138032,0.2213617132259186


In all 3 of these line graphs,we see a sharp decline in the ratio of elite to non-elite reviews until bin 4 (which translates to the 40th review). From bin 4 onward, the ratio of elite to non-elite reviews tends to level out.

#Conclusion

Based on our data and research, it seems that the answer to our initial question of whether or not Elite users tend to review the unreviewed is a resounding "yes". Even when taking into consideration multiple different methods of defining and categorizing what an elite review is, the trends in the data still hold true that elite reviewers tend to be among the first people to review businesses. During the time before 30 reviews, a business can be considered to be unreviewed and the ratio of elite reviews to non-elite reviews is the highest in this time frame. Once a business hits the 40 review mark, the ratio of new elite reviews to non-elite reviews remains steady. This could possibly suggest that non-elite reviewers may take cues to visit unreviewed businesses after seeing that among the few reviews that the businesses do have, a few of them are from elite reviewers.

#Limitations

The minimum amount of reviews was a compromise that we needed to make between having a useful ratio of elite to non-elite reviews and having a larger amount of entries.