## Motivation
### What is your dataset?
The dataset used in this project comes from the data provided for the Yelp Dataset Challenge. This dataset consists of about 1.5 million users and about 200 thousand businesses from North America.  Additionally the dataset includes just under 6 million reviews, made by users of the Yelp service, to businesses. The businesses included in the dataset are both restaurants as well as businesses offering other services, such as postal delivery. 

### Why did you choose this/these particular dataset(s)?
The dataset is tremendous
### What was your goal for the end user's experience?
The purpose of this project is to investigate properties of Yelp’s Elite users. For this paper, the focus will lie on Yelp’s two primary claims about their Elite users:

Yelp states that its Elite users have high connectivity, which means that they are connected with many other users and interact often with members of their Yelp community. 

Yelp claims that its Elite users make up the “true heart of the Yelp community.” Third, Yelp claims that its users have high contribution, which means that the user has made a large impact on the site with meaningful and high-quality reviews. 

The first goal of our project is to analyze whether the above claims about Yelp’s Elite users are quantifiably valid. For this, we will specify several characteristics which we expect Elite users to have based on these claims. We will then perform analyses on Yelp’s dataset in order to determine whether these properties are truly represented among the Elite users. The secondary goal of our project is to find which properties are most indicative of Elite status on Yelp. 

The analyses for the first goal can be used for this purpose as well. This kind of information may be useful for those who are interested in becoming Elite members on Yelp. In order to become a member of the “Elite squad,” a user must go through an application process. Despite the suggestions presented above, Yelp doesn’t provide any specific criteria on exactly what characteristics a user must have to become Elite. The mystery behind the selection process for Elite users is well-documented.




## Basic stats. Let's understand the dataset better
### Write about your choices in data cleaning and preprocessing
- Mis-formatted JSON to valid JSON

In [26]:
import pandas as pd
def cleanup(N, dataset, chunk_size=100000):
    '''
    Cleans up a JSON file by adding a trailing comma to each line,
    which is missing from the Yelp dataset files.
    A chunk size must be specified, since all the lines in the data
    files cannot be stored in memory at the same time, due to being very large!
    '''
    for k in range(N):
        dirty_path = 'yelp_dataset/yelp_academic_dataset_%s.json' % dataset
        clean_path = "cleaned/%s%i.json" % (dataset, k)
        dirty_file = open(dirty_path, "r")
        clean_file = open(clean_path, "w")


        start = chunk_size * k
        end = chunk_size * (k+1)

        content = ''
        i = 0
        for line in dirty_file:
            if i == end:
                break
            elif i >= start:
                s = line.replace('\n', ',\n')
                content += s
            i += 1
        if content:
            payload = '{"data" : \n[%s]}' % (content[:-2] + '\n')
            clean_file.write(payload)
        else:
            print("No more content.")
    
    print('Iteration', k, 'done')
    
def read_json_to_df(N, dataset):
    # Create dataframe from JSON files
    df_matrix = [None] * N
    for i in range(N):
        path = "cleaned/business%i.json" % i
        df_matrix[i] = pd.DataFrame(list(pd.read_json(path).data))
    return pd.concat(df_matrix)

### Get all restaurants from Toronto

In [27]:
# Clean business JSON files
N = 2 # There are about 200k restaurants, therefore 2 chunks of 100k elements is sufficient
dataset = 'business'
cleanup(N, dataset)

# Make dataframe from JSON data
df = read_json_to_df(N, dataset)

# Restaurants will contain the keywords 'restaurant' 
# and/or 'food' in the 'category' attribute.
keywords = ['restaurant', 'food']
idx = df.categories.str.lower().str.contains("|".join(keywords)).fillna(False)
rest = df[idx]


# Only include Toronto restaurants
rest.city = rest.city.str.lower()
rest = rest[rest.city == 'toronto']

# Drop attributes irrelevant to the analysis
rest = rest.drop(['city', 'attributes', 'categories', 'address', 'neighborhood', 'is_open', 'hours'], axis=1)

# Save dataset to CSV
rest.to_csv('toronto2/toronto_restaurants.csv', header=False)

Iteration 1 done


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


### Get all reviews from Toronto

In [29]:
# Clean business JSON files
N = 60 # There are about 6M reviews, therefore 60 chunks of 100k elements is sufficient
dataset = 'review'
cleanup(N, dataset)

# Make dataframe from JSON data
df = read_json_to_df(N, dataset)

# Filter out reviews of businesses outside Toronto
reviews = df[df.business_id.isin(rest.business_id)]

# Drop attributes irrelevant to the analysis
reviews = reviews.drop(['cool', 'funny', 'useful'], axis=1)

# Save dataset to CSV
reviews.to_csv('toronto2/toronto_reviews.csv')

KeyboardInterrupt: 

### Get all users in the Toronto reviews

In [None]:
# Clean business JSON files
N = 30 # A guess
dataset = 'user'
cleanup(N, dataset)

# Make dataframe from JSON data
df = read_json_to_df(N, dataset)

# Filter out users not in the Toronto reviews
toronto_users = df[df.user_id.isin(reviews.user_id)]

# Drop attributes irrelevant to the analysis
toronto_users = toronto_users.drop(['compliment_cool', 'compliment_cute',
       'compliment_funny', 'compliment_hot', 'compliment_list',
       'compliment_more', 'compliment_note', 'compliment_photos',
       'compliment_plain', 'compliment_profile', 'compliment_writer', 'cool',
     'funny', 'fans'], axis=1)

# Save to CSV
toronto_users.to_csv('toronto/toronto_users.csv', index=False)

### Dataset stats
- Reviews: ~6 million
- Users: Many
- Businesses: ~200,000

For this project the restaurants in Toronto were the main focus, as Toronto is a big city with more than a sufficient amount of data to perform a serious analysis, but small enough for various graph algorithms to be carried out. The users considered in this project were all the users who left a review on a business in Toronto.

- Period: March 1st 2008 to August 1st 2018
- Reviews: ~380,000
- Users: ~85,000
- Elite users hereof: ~7,500
- Restaurants: ~10,000

## Tools, theory and analysis. Describe the process of theory to insight
### Talk about how you've worked with text, including regular expressions, unicode, etc.
### Describe which network science tools and data analysis strategies you've used, how those network science measures work, and why the tools you've chosen are right for the problem you're solving.
#### Modelling the network
The Toronto Yelp review network was modelled as an undirected graph, containing user nodes where the edges between two user nodes represent the fact two users have reviewed the same restaurant. 

#### Most connected subcomponent
Detecting how important the elite users were for the network was done by deleting them one by one from the graph, and then watching how the largest connected subgraph shrinks. The elite users were deleted based on their degree centrality.
### How did you use the tools to understand your dataset?



## Discussion. Think critically about your creation
### What went well?
### What is still missing? What could be improved?