# This notebook is for selecting domain O and D

## Firstly, we will analyse what data we have *-Caution: Python 2.7 is used!-*
- This step, let's load businesses CSV file to discern what categories are there in YELP R10 data
    - After some time, I discovered that the categories aren't a list of lists, but of **strings**! -_-

In [1]:
import numpy as np;
import pandas as pd;
import ast; #import the abstracr syntax trees library, to evaluate the string of multiple categories
#from ggplot import *; #https://github.com/yhat/ggpy for documentation
import matplotlib.pyplot as plt;

businessesFilePath = "/media/sog/Data/College/DKEM/3_WS_17_18/Team Project YELP SA/dataset/business.csv";
dfBusinesses = pd.read_csv(businessesFilePath,
                          usecols=['business_id','categories'],
                          engine='python');

#Take a look at what we have in there:
#print dfBusinesses.head();

#Deceptively enough, categories is of type string, not list, so we need to convert it to strings
dictCategories = {};
for categoryRecord in dfBusinesses["categories"]:
    lstCategories = ast.literal_eval(categoryRecord);
    for cat in lstCategories:
        dictCategories[cat] = dictCategories.get(cat,0) + 1;
        
#Let's have a look at 10 elements:
print {k: dictCategories[k] for k in sorted(dictCategories.keys())[:10]};

{u'Addiction Medicine': 7, u'& Probates': 37, u'Acai Bowls': 37, u'3D Printing': 5, u'Accessories': 1600, u'Active Life': 7427, u'Accountants': 214, u'Acupuncture': 503, u'Acne Treatment': 11, u'ATV Rentals/Tours': 40}


- We need to perform a group count for the categories, to get an insight of businesses tallies. Thank God we don't have arbitrary nesting, otherwise we would need to flatten

In [2]:
#Convert the dictionary to a DataFrame
dfCategories = pd.DataFrame.from_dict(dictCategories, orient="index");
dfCategories.columns = ["Frequency"];
dfCategories = dfCategories.sort_values(by="Frequency",ascending=False);

print dfCategories.head(10);

                  Frequency
Restaurants           51613
Shopping              24595
Food                  23014
Beauty & Spas         15139
Home Services         13202
Health & Medical      12033
Nightlife             11364
Bars                   9868
Automotive             9476
Local Services         9343


- Let's calculate the percentage of these businesses' types according to the total number of businesses, keeping in mind that a certain entity may be ascribed to more than 1 category, so it's not a 1-1 mapping

In [6]:
#Add the percentage of occurences of a category in the whole dataset
print "Total number of business entities: " + str(dfBusinesses.shape[0]) + "\n";
dfCategories["Category_Total_pct"] = dfCategories["Frequency"] / dfBusinesses.shape[0];
#List the first 1% of business by number of ocurrence
print dfCategories.head(int(0.01 * dfCategories.shape[0]));

Total number of business entities: 156639

                           Frequency  Category_Total_pct
Restaurants                    51613            0.329503
Shopping                       24595            0.157017
Food                           23014            0.146924
Beauty & Spas                  15139            0.096649
Home Services                  13202            0.084283
Health & Medical               12033            0.076820
Nightlife                      11364            0.072549
Bars                            9868            0.062998
Automotive                      9476            0.060496
Local Services                  9343            0.059647
Event Planning & Services       8038            0.051315
Active Life                     7427            0.047415


- As a first rule to pick the categories, we want to exclude the ones that describe less than 5% of all entities, assuming that their datasets sizes aren't encouraging to adopt them as candidates for domain O, or even D:

In [4]:
#Picking the ones whose pct is >= 5%
dfCandidateCategories = dfCategories[dfCategories["Category_Total_pct"] >= 0.0500];
print dfCandidateCategories;

                           Frequency  Category_Total_pct
Restaurants                    51613            0.329503
Shopping                       24595            0.157017
Food                           23014            0.146924
Beauty & Spas                  15139            0.096649
Home Services                  13202            0.084283
Health & Medical               12033            0.076820
Nightlife                      11364            0.072549
Bars                            9868            0.062998
Automotive                      9476            0.060496
Local Services                  9343            0.059647
Event Planning & Services       8038            0.051315


- We have now to *merge* between the business entities which correspond to one of these categories and the reviews, in order to analyse the number of reviews as a crucial factor in choosing our domains
    - Since we are dealing with strings in categories, we have to build a unicode filter vector, as follows (more on this [Here](https://stackoverflow.com/questions/11350770/pandas-dataframe-select-by-partial-string
))

In [5]:
#Build the filter from our candidate categories:
unicodeFilter = "";


for cat in dfCandidateCategories.index:
    unicodeFilter += "|" + str(cat);
    
    
print str(unicodeFilter) + "\n\n" + str(dfCandidateCategories.shape[0]) \
+ " business categories to be considered:\n\n---------------\n\n"


dfCandidateBusinessesIds = dfBusinesses.loc[dfBusinesses["categories"].str.contains(unicodeFilter)];


print "After filtering, " + str(dfCandidateBusinessesIds.shape[0]) + " out of " \
+ str(dfBusinesses.shape[0]) + " " \
"business entities left ("+ str(100*dfCandidateBusinessesIds.shape[0]/dfBusinesses.shape[0]) +"%).\n";


print dfCandidateBusinessesIds.head();

|Restaurants|Shopping|Food|Beauty & Spas|Home Services|Health & Medical|Nightlife|Bars|Automotive|Local Services|Event Planning & Services

11 business categories to be considered:

---------------


After filtering, 156639 out of 156639 business entities left (100%).

              business_id                                         categories
0  YDf95gJZaq05wvo7hTQbbQ                 [u'Shopping', u'Shopping Centers']
1  mLwM-h2YhXl2NCgdS84_Bw  [u'Food', u'Soul Food', u'Convenience Stores',...
2  v2WhjAB3PIBA8J8VxG3wEg                         [u'Food', u'Coffee & Tea']
3  CVtCbSB1zUcUWg-9TNGTuQ         [u'Professional Services', u'Matchmakers']
4  duHFBe87uNSXImQmvBh87Q                    [u'Sandwiches', u'Restaurants']


- It appears that the candidate categories cover the **whole dataset**! I think that's a *good* thing..
    - so let's split the businesses data to extract merge vectors of each category, in order to get the relevant number of reviews we have later on
        - Splitting was done in pure code rather than the notebook.