<a href="https://colab.research.google.com/github/DimaKav/DS-Sprint-01-Dealing-With-Data/blob/master/module1-afirstlookatdata/LS_DS_111_A_First_Look_at_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science - A First Look at Data



## Lecture - let's explore Python DS libraries and examples!

The Python Data Science ecosystem is huge. You've seen some of the big pieces - pandas, scikit-learn, matplotlib. What parts do you want to see more of?

I would like to see more data visualization.

In [20]:
# TODO - we'll be doing this live, taking requests
# and reproducing what it is to look up and learn things

# First five things you can do with the number 1
dir(1)[0:5]

['__abs__', '__add__', '__and__', '__bool__', '__ceil__']

## Assignment - now it's your turn

Pick at least one Python DS library, and using documentation/examples reproduce in this notebook something cool. It's OK if you don't fully understand it or get it 100% working, but do put in effort and look things up.

### Assignment questions

After you've worked on some code, answer the following questions in this text block:

1.  Describe in a paragraph of text what you did and why, as if you were writing an email to somebody interested but nontechnical.

2.  What was the most challenging part of what you did?

3.  What was the most interesting thing you learned?

4.  What area would you like to explore with more time?




In [0]:
# TODO - your code here
# Use what we did live in lecture as an example

In [0]:
# Use yelp data to predict a restaurant's yelp rating


# Load data
import pandas as pd

businesses = pd.read_json('yelp_business.json', lines=True)
reviews = pd.read_json('yelp_review.json', lines=True)
users = pd.read_json('yelp_user.json', lines=True)
checkins = pd.read_json('yelp_checkin.json', lines=True)
tips = pd.read_json('yelp_tip.json', lines=True)
photos = pd.read_json('yelp_photo.json', lines=True)

In [0]:
# Combine data with respect to the target variable, rating
df = pd.merge(businesses, reviews, how='left', on='business_id')
df = pd.merge(df, users, how='left', on='business_id')
df = pd.merge(df, checkins, how='left', on='business_id')
df = pd.merge(df, tips, how='left', on='business_id')
df = pd.merge(df, photos, how='left', on='business_id')

In [0]:
# Clean the data to get rid of the non-predictive features
features_to_remove = ['address','attributes','business_id','categories','city','hours','is_open','latitude','longitude','name','neighborhood','postal_code','state','time']
df.drop(features_to_remove, axis=1, inplace=True)

In [46]:
# Fill nans with 0
df.fillna(0, inplace=True)
print(df.isna().any())

alcohol?                      False
good_for_kids                 False
has_bike_parking              False
has_wifi                      False
price_range                   False
review_count                  False
stars                         False
take_reservations             False
takes_credit_cards            False
average_review_age            False
average_review_length         False
average_review_sentiment      False
number_cool_votes             False
number_funny_votes            False
number_useful_votes           False
average_days_on_yelp          False
average_number_fans           False
average_number_friends        False
average_number_years_elite    False
average_review_count          False
weekday_checkins              False
weekend_checkins              False
average_tip_length            False
number_tips                   False
average_caption_length        False
number_pics                   False
dtype: bool


In [48]:
# See which features are most correlated to rating
df.corr().stars.sort_values()

average_review_length        -0.277081
average_review_age           -0.125645
average_review_count         -0.066572
average_number_years_elite   -0.064419
average_tip_length           -0.052899
price_range                  -0.052565
alcohol?                     -0.043332
has_wifi                     -0.039857
average_days_on_yelp         -0.038061
average_number_fans          -0.031141
good_for_kids                -0.030382
take_reservations            -0.024486
average_number_friends       -0.007629
number_useful_votes          -0.000066
average_caption_length        0.000040
number_funny_votes            0.001320
number_pics                   0.001727
weekday_checkins              0.004130
weekend_checkins              0.007863
number_tips                   0.014038
review_count                  0.032413
takes_credit_cards            0.037748
number_cool_votes             0.043375
has_bike_parking              0.068084
average_review_sentiment      0.782187
stars                    

Looks like review length, sentiment, age are the most correlated.

In [0]:
# Define data that will potentially achieve an accurate model

# subset of only average review sentiment
sentiment = ['average_review_sentiment']
# subset of all features that have a response range [0,1]
binary_features = ['alcohol?','has_bike_parking','takes_credit_cards','good_for_kids','take_reservations','has_wifi']
# subset of all features that vary on a greater range than [0,1]
numeric_features = ['review_count','price_range','average_caption_length','number_pics','average_review_age','average_review_length','average_review_sentiment','number_funny_votes','number_cool_votes','number_useful_votes','average_tip_length','number_tips','average_number_friends','average_days_on_yelp','average_number_fans','average_review_count','average_number_years_elite','weekday_checkins','weekend_checkins']
# all features
all_features = binary_features + numeric_features

In [0]:
# Create a linear regression model with this function
import numpy as np

# take a list of features to model as a parameter
def model_these_features(feature_list):
    
    # Set the target and the features you want to use
    ratings = df.loc[:,'stars']
    features = df.loc[:,feature_list]
    
    # Split the data into 80/20 train/test
    X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)
    
    # don't worry too much about these lines, just know that they allow the model to work when
    # we model on just one feature instead of multiple features. Trust us on this one :)
    if len(X_train.shape) < 2:
        X_train = np.array(X_train).reshape(-1,1)
        X_test = np.array(X_test).reshape(-1,1)
    
    # Create linear regression model and fit it to the training data
    model = LinearRegression()
    model.fit(X_train,y_train)
    
    # R^2 is the coeffiecient of determination, it measures how much of the variance in our dependent
    # variable is explained by the independent variables
    print('Train Score:', model.score(X_train,y_train))
    print('Test Score:', model.score(X_test,y_test))
    
    # print the model features and their corresponding coefficients, from most predictive to least predictive
    print(sorted(list(zip(feature_list,model.coef_)),key = lambda x: abs(x[1]),reverse=True))
    
    # Use the models coefficients to calculate what you want to predict(yelp rating)
    y_predicted = model.predict(X_test)
    
    # Plots
    plt.scatter(y_test,y_predicted)
    plt.xlabel('Yelp Rating')
    plt.ylabel('Predicted Yelp Rating')
    plt.ylim(1,5)
    plt.show()

In [0]:
# Make some imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [55]:
# Run the function to see how changing the feature sets affects R^2 value of our model

model_these_features(sentiment)

Train Score: 0.6118980950438655
Test Score: 0.6114021046919492
[('average_review_sentiment', 2.303390843374967)]


VBox(children=(Figure(axes=[Axis(label='Yelp Rating', scale=LinearScale()), Axis(label='Predicted Yelp Rating'…

In [56]:
model_these_features(binary_features)

Train Score: 0.012223180709591164
Test Score: 0.010119542202269072
[('has_bike_parking', 0.1900300820804082), ('alcohol?', -0.14549670708138862), ('has_wifi', -0.13187397577762405), ('good_for_kids', -0.08632485990337223), ('takes_credit_cards', 0.07175536492195503), ('take_reservations', 0.04526558530451638)]


VBox(children=(Figure(axes=[Axis(label='Yelp Rating', scale=LinearScale()), Axis(label='Predicted Yelp Rating'…

In [57]:
model_these_features(numeric_features)

Train Score: 0.6734992593766658
Test Score: 0.6713318798120134
[('average_review_sentiment', 2.272107664209677), ('price_range', -0.08046080962701445), ('average_number_years_elite', -0.07190366288054284), ('average_caption_length', -0.0033470660077835218), ('number_pics', -0.00295650281289204), ('number_tips', -0.0015953050789030726), ('number_cool_votes', 0.0011468839227092916), ('average_number_fans', 0.0010510602097447742), ('average_review_length', -0.0005813655692094734), ('average_tip_length', -0.0005322032063457423), ('number_useful_votes', -0.00023203784758712564), ('average_review_count', -0.0002243170289508221), ('average_review_age', -0.00016930608165089726), ('average_days_on_yelp', 0.00012878025876724556), ('weekday_checkins', 5.91858075446039e-05), ('weekend_checkins', -5.518176206974201e-05), ('average_number_friends', 4.826992111583864e-05), ('review_count', -3.483483763867104e-05), ('number_funny_votes', -7.884395674567053e-06)]


VBox(children=(Figure(axes=[Axis(label='Yelp Rating', scale=LinearScale()), Axis(label='Predicted Yelp Rating'…

In [58]:
model_these_features(all_features)

Train Score: 0.6807828861895335
Test Score: 0.6782129045869245
[('average_review_sentiment', 2.2808456996623683), ('alcohol?', -0.1499149859346954), ('has_wifi', -0.12155382629262958), ('good_for_kids', -0.11807814422012454), ('price_range', -0.06486730150041427), ('average_number_years_elite', -0.06278939713895423), ('has_bike_parking', 0.027296969912258707), ('takes_credit_cards', 0.024451837853625796), ('take_reservations', 0.014134559172965556), ('number_pics', -0.0013133612300810522), ('average_number_fans', 0.001026798682265563), ('number_cool_votes', 0.0009723722734413303), ('number_tips', -0.0008546563320881045), ('average_caption_length', -0.0006472749798193465), ('average_review_length', -0.0005896257920272468), ('average_tip_length', -0.0004205217503405806), ('number_useful_votes', -0.0002715064125617315), ('average_review_count', -0.00023398356902508714), ('average_review_age', -0.00015776544111326633), ('average_days_on_yelp', 0.00012326147662885568), ('review_count', 0.00

VBox(children=(Figure(axes=[Axis(label='Yelp Rating', scale=LinearScale()), Axis(label='Predicted Yelp Rating'…

In [59]:
# Let's grab all the features and retrain our model, maybe we can get better accuracy

features = df.loc[:,all_features]
ratings = df.loc[:,'stars']
X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)
model = LinearRegression()
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [60]:
# Let's check the range of values for each feature so we can use them to make a predictive model

pd.DataFrame(list(zip(features.columns,features.describe().loc['mean'],features.describe().loc['min'],features.describe().loc['max'])),columns=['Feature','Mean','Min','Max'])

Unnamed: 0,Feature,Mean,Min,Max
0,alcohol?,0.14061,0.0,1.0
1,has_bike_parking,0.350692,0.0,1.0
2,takes_credit_cards,0.700243,0.0,1.0
3,good_for_kids,0.279029,0.0,1.0
4,take_reservations,0.106086,0.0,1.0
5,has_wifi,0.134968,0.0,1.0
6,review_count,31.79731,3.0,7968.0
7,price_range,1.035855,0.0,4.0
8,average_caption_length,2.831829,0.0,140.0
9,number_pics,1.489939,0.0,1150.0


In [0]:
# Feature values for My Awesome Restaurant
my_awesome_restaurant = np.array([0,1,1,1,1,1,10,2,3,10,10,1200,0.9,3,6,5,50,3,50,1800,12,123,0.5,0,0]).reshape(1,-1)

In [62]:
# Predict the yelp rating based on the features of my awesome restaurant
model.predict(my_awesome_restaurant)

array([4.03799004])

**Describe in a paragraph of text what you did and why, as if you were writing an email to somebody interested but nontechnical.**

The success of a restaurant is highly correlated to its online rep. Since the most queried review site is Yelp, I decided to make a predictive model based on Yelp data, that would predict the Yelp rating of My Awesome Restaurant. At the end of this notebook, we can change the value each feature and see how it potentially reflects on our overall rating. Note: the accuracy of the current model is around 67%.

**What was the most challenging part of what you did?**

Figuring out the best model and hyperparameters. Also figuring out what the code actually does.

**What was the most interesting thing you learned?**

.corr method in Pandas is very useful.

**What area would you like to explore with more time?**

Different model, neural networks are cool and interesting, I would like to learn a lot more about them.

## Stretch goals and resources

Following are *optional* things for you to take a look at. Focus on the above assignment first, and make sure to commit and push your changes to GitHub (and since this is the first assignment of the sprint, open a PR as well).

- [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [scikit-learn documentation](http://scikit-learn.org/stable/documentation.html)
- [matplotlib documentation](https://matplotlib.org/contents.html)
- [Awesome Data Science](https://github.com/bulutyazilim/awesome-datascience) - a list of many types of DS resources

Stretch goals:

- Find and read blogs, walkthroughs, and other examples of people working through cool things with data science - and share with your classmates!
- Write a blog post (Medium is a popular place to publish) introducing yourself as somebody learning data science, and talking about what you've learned already and what you're excited to learn more about