# Restaurant Review Prediction Project Using Machine Learning

The restaurant industry is tougher than ever, with reviews on the Internet from day one of opening a restaurant. But as a food lover, you and a friend decide to enter the industry and open your own restaurant. Since a restaurant's success is highly correlated with its reputation, you want to make sure it has the best reviews on the most consulted restaurant rating and review site: <a href="https://www.yelp.com/" target="blank">Yelp!</a>.

While you know your food will be delicious, you believe there are other factors that influence the Yelp rating that will ultimately determine the success of your business. With a dataset of different restaurant characteristics and your Yelp stars, you decide to use a Multiple Linear Regression model to investigate which factors most affect a restaurant review and predict the number of stars on Yelp for your restaurant.

In this project we will work with a real data set provided by Yelp. We have provided six files, which are listed below with a brief description:

This is a large amount of data. The idea of this challenge is that you can simulate a real project environment.

* `yelp_business.json`: establishment data related to the location and attributes of all the companies in the dataset.
* `yelp_review.json`: metadata of the ratings per company.
* `yelp_user.json`: user profile metadata per company
* `yelp_checkin.json`: online billing metadata per company
* `yelp_tip.json`: tips metadata per company
* `yelp_photo.json`: photo metadata per company'

There are several sources for Yelp data.  Yelp hosts data [here](https://www.yelp.com/dataset).  Kaggle also provides a [dataset](https://www.kaggle.com/yelp-dataset/yelp-dataset).  I am uncertain what the specific differences are between the 2 sources, if any.

I decided to use the Kaggle data in the exercise below.

## Load the data and take a look at it

To get a better understanding of the dataset we can use Pandas to explore the data in the form of DataFrame. In the following code block you must import Pandas. The `read_json()` method reads the data from a json file into a DataFrame, as shown below:

`df = pd.read_json('file_name.json', lines=True)`.

Load the data from each of the json files with the following naming conventions:

`yelp_business.json` in a DataFrame named `business`.
`yelp_review.json` in a DataFrame named `review`.
`yelp_user.json` in a DataFrame named `user`
`yelp_checkin.json` in a DataFrame naming `checkin`
`yelp_tip.json` in a DataFrame named `tip`
`yelp_photo.json` in a DataFrame named `photos`.

Import all the datasets as indicated above

In [5]:
# This code fails because the data is too large for the review data
# import pandas as pd
# business = pd.read_json('../../LargeData/yelp/yelp_academic_dataset_business.json', lines=True)
# reviews = pd.read_json('../../LargeData/yelp/yelp_academic_dataset_review.json', lines=True)
# users = pd.read_json('../../LargeData/yelp/yelp_academic_dataset_user.json', lines=True)
# checkins = pd.read_json('../../LargeData/yelp/yelp_academic_dataset_checkin.json', lines=True)
# tips = pd.read_json('../../LargeData/yelp/yelp_academic_dataset_tip.json', lines=True)
# photos = pd.read_json("'../../LargeData/yelp/photos.json', lines=True")

In [6]:
import pymongo
import pdmongo as pdm
import pandas as pd

On Home PC, start Mongdb to read from the yelp database and the colelctions within it.

In order to see more clearly the information in our DataFrame, we can adjust the number of columns displayed (`max_columns`) and the number of characters displayed in a column (`max_colwidth`) with the following code:

```
pd.options.display.max_columns = number_of_columns_to_display
pd.options.display.max_colwidth = number_of_characters_to_display
```

Set `max_columns` to `60` and `max_colwidth` to `500`. We are working with some BIG data here! (welcome to Big Data!)

In [7]:
pd.options.display.max_columns = 60
pd.options.display.max_colwidth = 500

In [8]:
df_business = pdm.read_mongo("business", [], "mongodb://localhost:27017/yelp")

In [9]:
df_business.head()

Unnamed: 0,_id,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}"
1,605ba08cbf4286a93a81ab53,tCbdrRPZA0oiIYSmHG3J0w,Flying Elephants at PDX,7000 NE Airport Way,Portland,OR,97218,45.588906,-122.593331,4.0,126,1,"{'RestaurantsTakeOut': 'True', 'RestaurantsAttire': 'u'casual'', 'GoodForKids': 'True', 'BikeParking': 'False', 'OutdoorSeating': 'False', 'Ambience': '{'romantic': False, 'intimate': False, 'touristy': False, 'hipster': False, 'divey': False, 'classy': False, 'trendy': False, 'upscale': False, 'casual': True}', 'Caters': 'True', 'RestaurantsReservations': 'False', 'RestaurantsDelivery': 'False', 'HasTV': 'False', 'RestaurantsGoodForGroups': 'False', 'BusinessAcceptsCreditCards': 'True', 'No...","Salad, Soup, Sandwiches, Delis, Restaurants, Cafes, Vegetarian","{'Monday': '5:0-18:0', 'Tuesday': '5:0-17:0', 'Wednesday': '5:0-18:0', 'Thursday': '5:0-18:0', 'Friday': '5:0-18:0', 'Saturday': '5:0-18:0', 'Sunday': '5:0-18:0'}"
2,605ba08cbf4286a93a81ab54,6iYb2HFDywm3zjuRg0shjw,Oskar Blues Taproom,921 Pearl St,Boulder,CO,80302,40.017544,-105.283348,4.0,86,1,"{'RestaurantsTableService': 'True', 'WiFi': 'u'free'', 'BikeParking': 'True', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}', 'BusinessAcceptsCreditCards': 'True', 'RestaurantsReservations': 'False', 'WheelchairAccessible': 'True', 'Caters': 'True', 'OutdoorSeating': 'True', 'RestaurantsGoodForGroups': 'True', 'HappyHour': 'True', 'BusinessAcceptsBitcoin': 'False', 'RestaurantsPriceRange2': '2', 'Ambience': '{'touristy': False, 'hipst...","Gastropubs, Food, Beer Gardens, Restaurants, Bars, American (Traditional), Beer Bar, Nightlife, Breweries","{'Monday': '11:0-23:0', 'Tuesday': '11:0-23:0', 'Wednesday': '11:0-23:0', 'Thursday': '11:0-23:0', 'Friday': '11:0-23:0', 'Saturday': '11:0-23:0', 'Sunday': '11:0-23:0'}"
3,605ba08cbf4286a93a81ab55,oaepsyvc0J17qwi8cfrOWg,Great Clips,2566 Enterprise Rd,Orange City,FL,32763,28.914482,-81.295979,3.0,8,1,"{'RestaurantsPriceRange2': '1', 'BusinessAcceptsCreditCards': 'True', 'GoodForKids': 'True', 'ByAppointmentOnly': 'False'}","Beauty & Spas, Hair Salons",
4,605ba08cbf4286a93a81ab56,PE9uqAjdw0E4-8mjGl3wVA,Crossfit Terminus,1046 Memorial Dr SE,Atlanta,GA,30316,33.747027,-84.353424,4.0,14,1,"{'GoodForKids': 'False', 'BusinessParking': '{'garage': False, 'street': False, 'validated': False, 'lot': False, 'valet': False}', 'BusinessAcceptsCreditCards': 'True'}","Gyms, Active Life, Interval Training Gyms, Fitness & Instruction","{'Monday': '16:0-19:0', 'Tuesday': '16:0-19:0', 'Wednesday': '16:0-19:0', 'Thursday': '16:0-19:0', 'Friday': '16:0-19:0', 'Saturday': '9:0-11:0'}"


In [10]:
df_review = pdm.read_mongo("review", [], "mongodb://localhost:27017/yelp")

In [11]:
df_review.head()

Unnamed: 0,_id,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,605ba0f1d6b0c7d7a785875b,9vqwvFCBG3FBiHGmOHMmiA,XGkAG92TQ3MQUKGX9sLUhw,DbXHNl890xSXNiyRczLWAg,5.0,0,0,0,"Probably one of the better breakfast sandwiches I've ever had. I had the EGGMEATMUFFIN, the bread was toasted perfectly and the bacon was a real thick cut. Not that lame bacon we are more familiar with at your conventional breakfast diner. In addition, the place was clean and the staff was very helpful. The butcher had several different cuts available and was knowledgable as well as friendly. I left with some cuts of pork and beef and am excited to come back!",2017-12-02 18:16:13
1,605ba0f1d6b0c7d7a785875c,lWC-xP3rd6obsecCYsGZRg,ak0TdVmGKo4pwqdJSTLwWw,buF9druCkbuXLX526sGELQ,4.0,3,1,1,Apparently Prides Osteria had a rough summer as evidenced by the almost empty dining room at 6:30 on a Friday night. However new blood in the kitchen seems to have revitalized the food from other customers recent visits. Waitstaff was warm but unobtrusive. By 8 pm or so when we left the bar was full and the dining room was much more lively than it had been. Perhaps Beverly residents prefer a later seating. \n\nAfter reading the mixed reviews of late I was a little tentative over our choice b...,2014-10-11 03:34:02
2,605ba0f1d6b0c7d7a785875d,KKVFopqzcVfcubIBxmIjVA,99RsBrARhhx60UnAC4yDoA,EEHhKSxUvJkoPSzeGKkpVg,5.0,0,0,0,"I work in the Pru and this is the most affordable and tasty place in the food court. deals where a meal is $5-$7 and the chicken pesto is really good. I am not a chowda person but all there soups I have had are pretty damn good. Broccoli chicken is my favorite. Also, probably the most personable Food court staff I have ever had the pleasure of ordering from.",2014-05-07 18:10:21
3,605ba0f1d6b0c7d7a785875e,2l_TDrQ7p-5tANOyiOlkLQ,LWUnzwK0ILquLLZcHHE1Mw,mD-A9KOWADXvfrZfwDs-jw,4.0,1,0,0,"I am definitely a fan of Sports Authority. This particular location has a good check in deal. We came here near Christmas time to buy some presents and we had a good experience. The staff members were very friendly and they helped us find what we were looking for. We got some golf stuff, two pairs of shoes, a tennis racket bag, and some bicycle accessories. The store was clean and well organized. They have everything from apparel to basketballs and everything in between. Good spot to ...",2012-05-28 15:00:47
4,605ba0f1d6b0c7d7a785875f,FdoBFTjXXMn4hVnJ59EtiQ,eLAYHxHUutiXswy-CfeiUw,WQFn1A7-UAA4JT5YWiop_w,1.0,0,0,0,"They NEVER seem to get our \norder correct, service is crappy, food is inconsistent and has gone down hill steadily in the last 6-9 months! WILL NEVER GO THERE AGAIN!",2017-09-08 23:26:10


Because of the size of the review data, merge data with the business data then remove the data from memory.

In [12]:
df_merged = pd.merge(df_business, df_review, how="left", on="business_id")
print(len(df_merged))

8635403


In [13]:
import gc
del [[df_business,df_review]]
gc.collect()
df_business=pd.DataFrame()
df_review=pd.DataFrame()

In [14]:
df_merged.head()

Unnamed: 0,_id_x,business_id,name,address,city,state,postal_code,latitude,longitude,stars_x,review_count,is_open,attributes,categories,hours,_id_y,review_id,user_id,stars_y,useful,funny,cool,text,date
0,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}",605ba0f1d6b0c7d7a78599c6,xYcbW9MPyLdy8fwbIoAyAQ,i9F35UmYkvBirC2Yl9fuqg,4.0,1,0,0,"Quaint little store with tons of amazing items! We purchased a cabinet from them and it looks amazing in our house. They are very helpful and always adding new inventory. If you let them know you are looking for a certain item, they will FB message you when they have new stock coming it! I like that customer attention! Makes me a BIG fan of this store. Can't wait to find my next piece.",2016-08-25 16:52:19
1,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}",605ba0f2d6b0c7d7a7864c0e,ktrlT7JkK9QwlLdpF-sTVA,WADswAe3hbMWM0uUxb_rrQ,5.0,4,0,1,"Gasp! AMAZING!\nThey took the idea of restoring and modifying vintage furniture and did it flawlessly!\nThere are other places in town that have tried this and dropped the ball, but not here!\n\nThe craziest part is, the prices aren't nearly as high as they should be!!!\nConsider how good this stuff looks, the time that went into doing it, the item itself, you're getting a steal of a deal on every item in here!\n\nNow for the things in here, when I showed up I saw:\n1 couch\n3 chairs \n1 cra...",2014-10-12 16:50:27
2,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}",605ba0f3d6b0c7d7a787177a,9p-z0pVIkXPnoVncOKTAtg,SQ87CzgSrsY9H5ankTA6UQ,4.0,0,0,0,You know that shop that doesn't have alot but what it does have is truly lovely. Well this is that place. The selection is incredibly sparse but what is there is excellent.,2015-01-21 04:38:44
3,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}",605ba0f6d6b0c7d7a7891f36,RjktBj4W_T078NK60kaUVw,JUW2nLyVUxc_lGVFxmj-VQ,5.0,1,0,0,"I brought in a client and found several pieces for her home. Impeccable restoration and priced well, this store is an absolute gem!",2015-10-26 20:20:49
4,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}",605ba0f7d6b0c7d7a789fc6b,LwPbnhZBTB37j00ivD_WKw,nkNAiBFCe9se6wyRZWkZYA,5.0,0,0,0,just bought a piece of furniture here. I love the selection of mid century they have and the way they restore pieces is truly amazing. so glad I happened upon this place. will be returning shortly.,2015-06-04 02:27:58


In [15]:
df_user = pdm.read_mongo("user", [], "mongodb://localhost:27017/yelp")

In [16]:
df_user.head()

Unnamed: 0,_id,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,605b9fb937964792354af884,q_QQ5kBBwlCcbL1s4NVK3g,Jane,1220,2005-03-14 20:26:35,15038,10030,11291,200620072008200920102011201220132014,"xBDpTUbai0DXrvxCe3X16Q, 7GPNBO496aecrjJfW6UWtg, gUfHciSP7BbxZd5gj-c4xw, NXw0bCLF5ZtFMfhcj7CFSw, OGjmMxPuIoLTJ3O-CO2A4g, mwUcJP11UkIjCB8jBAaS3g, fDmgb3Vi3-f_MtFOImH0Ug, -x1516ZG5GllZiBjDQlRkA, tx5UcfGrsud-CQdq8p8KKw, HKooPGsHiZV_0vTn45fG_w, 2iSBJHVMNsolJ3AH1g_D7A, NcoP47QP_eMVtoZvinwU-w, XPOx-mCubVGQ1rRknztFtw, K6Tbv3a_qUQK0ed4T0_u5Q, _dUWTJf0faMXMdr_RFW5Xw, 5ni2bacPC7scIAHAb9DmlQ, cG-UHRz9QdhBEBz3R_8X0Q, BPh-OMqPul6HXsnCHxsk6g, mgzNtI5XOuPwwukp6yh1Vw, pMbWlP0cAvtRSFMI-8mvaw, yf-DODoyVwAh3OQY...",1357,3.85,1710,163,190,361,147,1212,5691,2541,2541,815,323
1,605b9fb937964792354af885,D6ErcUnFALnCQN4b1W_TlA,Jason,119,2007-02-07 15:47:53,188,128,130,20102011,"GfB6sC4NJQvSI2ewbQrDNA, jhZtzZNNZJOU2YSZ6jPlXQ, DJWvvie6YTka5ylcqMIXvg, b3Fm3LStrOaYQ49ZyLkeOw, YwjzTTCF9Jor4-44SVeE1w, 4Po_4vgdBFk2GPeKfJqVww, N9CrBkTkDZFqQoL64e5USA, Q75-fmX3WoLZEkuBQGxZCg, _vgQ8aaBUMLQpJfJ5m-dwg, 8_Hs9Bh5tgFdc_1ycxHChg, SB5VqHvarUXF6kyu_s3nsw, R8hKfsaQnHtGpidtjJyrAg, sRVGewx--HgwwlKwAfCO9Q, GeRP723KwUXL32S_h-OdCg, jjT8b-kahZB4e4Av_mjwpw, yJCie8LyYQzl3kLDhBbFZQ, TEtzbpgA2BFBrC0y0sCbfw, KETa97Knz4ZJZPv53CHGcg, ZkURDI6S5GkxP7vnvQ_PRg, XAsqtZHeqw7uoPXP7oQx7w, aUM3q3B_D-FLyTZj...",16,3.76,22,1,3,0,0,5,20,31,31,3,1
2,605b9fb937964792354af886,1jXmzuIFKxTnEnR0pxO0Hg,Clara,299,2010-10-01 17:29:36,381,106,121,201020112012201320142015,"VGfzq5na6LZUwxwWO5eVLA, 35uHDsVOEsWbLdEg8Ttobg, bBfyLClSrcHGI6awaIc4Pw, SHz7TX5J3um3DoYZQAm4CA, utccnoVeY4-l1w2vXv8R8g, gKYI3VG0PcFOAHD9PoAfOw, 3GxGGcfabW2GeDyBQkQx4w, FQw4V4hPvU6u6a2SxamyMg, SxVEA7yCAgIe-i-SsMbQkA, xW8kYyBUm5BHcgxlPERLmQ, xRBnIhqYTJZcEA4-wuZ4uA, IrBr1IBag-P9NKOzcFsSKA, QrEcHAmQz6p9JKO0Plz98w, SLwEOhwenXdvrD1zhJLAFQ, TJMgOTcm2ShyOpablSYbXA, zB-qLoBsLHW-n44p6mJF5w, iovWFbA1UtMGDJ4C8hbhYA, tUZLQX2vOealmUmFdyOHmQ, 8h1r7mXd-H225yDgc4EMPg, M79rv2KKQA4cYEoTIBXvoQ, 2Fu4xzc4Z4SvRGDc...",23,3.43,17,8,2,6,0,17,47,30,30,4,1
3,605b9fb937964792354af887,-8QoOIfvwwxJ4sY201WP5A,Antoinette,288,2007-08-04 20:21:09,752,220,306,2012201320142015201620172018,"vePby1OhpTiQiX75XrN97A, UG8cewYtZdep2hzSekIqYg, 9BjOR87nmzIsrgez1P_TWQ, mmuRuaxqMYHksF7SXX0yUw, A9tcjDwZoR3BXZXX3WihTg, X3S5AhlmLhyiTlaiTF7NyA, Jk-2JlKDKaZK2e5HV6Wx8g, Bi-2ZIxHgWbttZxacF4DqA, tBOjKo204HYl3QAoUAdfsw, hj3sjhcb0cQlYwh4DBH-JA, LboXWlcBzR-jmrnH-FpNbw, 1P9BpFZ_d3PGCdytDTYJCw, 1uWTjDmXfbNrkbWQgfpURg, xOpVIotAt-JIpRH5TgP1QA, NPAqLyw-hdvO7xK7FCrKQw, 5uM2lMufqPocRVEWIoraQQ, 2d7ophBZbHcHqFcWyNQMPQ, 06eupKERHVeBbHhooCiWLg, dKzwOw29NX0Yt9ybYDF-SQ, rOC6g1YFj_khU-0CPaSLQw, c2ikre7HPiYw5HKh...",25,3.88,8,4,6,2,0,12,32,24,24,11,2
4,605b9fb937964792354af888,EtofuImujQBSo02xa6ZRtQ,Hollyanna,44,2008-11-03 17:31:30,159,176,138,20092010,"8z3hZO0SE75e6kO2Bsm8pA, bRHT_AP9l_HNgGsX4PnDiw, aIs1NUKkxzvHi4HcAiaXMg, _n40AvOlP7bZi3qcljaMyw, 3xZaC_Bv-SR_xGdEDE4BAg, lTvUDMzakGAVnBFSrAHB0g, ztvRQSJ2Be-7TtAYHuMu_w, BQcITnzA0Vi8xktm2R7CyA, sHDingS3sadBzIHY1JlZFg, 6R0mfcDiZCFoQQSgnXJ27g, HFECrzYDpgbS5EmTBtj2zQ, qUYxPCyf-Ofu0CybktNopw, GcRKbN1nKRe__o_ms_LDxA, KfPGrMyGMsjLtrGcV5WnNw, Ci2FtyUgZnnNi0y6-tRYEA, -pm-6Ma8PxYYE1ef0B7fKA, BUB_t_Rvzs1yPEzZipkWjw, Pf7FI0OukC_CEcCz0ZxoUw, SUAXjaF_NAlOfZfiRShF5Q, Sm908_-zs9Qs7GnecLJwCA, OGjsErSxIfcvpamj...",5,3.83,16,6,1,0,1,1,8,14,14,5,1


In [17]:
df_checkin = pdm.read_mongo("checkin", [], "mongodb://localhost:27017/yelp")

In [18]:
df_checkin.head()

Unnamed: 0,_id,business_id,date
0,605ba037b25d6670e8e6410f,--0r8K_AQ4FZfLsX3ZYRDA,2017-09-03 17:13:59
1,605ba037b25d6670e8e64110,--0zrn43LEaB4jUWTQH_Bg,"2010-10-08 22:21:20, 2010-11-01 21:29:14, 2010-12-23 22:55:45, 2011-04-08 17:14:59, 2011-04-11 21:28:45, 2011-04-26 16:42:25, 2011-05-20 19:30:57, 2011-05-24 20:02:21, 2011-08-29 19:01:31"
2,605ba037b25d6670e8e64111,--2mEJ63SC_8_08_jGgVIg,"2010-12-15 17:10:46, 2013-12-28 00:27:54, 2015-10-18 00:43:55, 2016-06-11 19:56:11"
3,605ba037b25d6670e8e64112,--2aF9NhXnNVpDV0KS3xBQ,"2014-11-03 16:35:35, 2015-01-30 18:16:03, 2015-03-16 18:45:30, 2016-08-25 15:42:01, 2017-03-27 20:32:57, 2018-02-12 23:13:56, 2019-04-22 19:34:48, 2020-12-29 16:22:00"
4,605ba037b25d6670e8e64113,--Q3mAcX9t63f7Xcbn7LVA,"2020-07-15 22:29:52, 2020-07-16 22:42:20, 2020-07-18 00:40:39, 2020-07-18 00:51:53, 2020-08-14 21:36:42, 2020-08-15 15:32:49, 2020-08-15 19:24:23, 2020-08-23 22:42:32, 2020-09-10 22:32:41, 2020-09-10 23:26:08, 2020-09-13 17:51:33, 2020-09-20 01:51:01, 2020-09-27 18:04:57, 2020-09-27 22:29:20, 2020-09-28 18:19:21, 2020-09-28 18:19:43, 2020-10-02 20:46:57, 2020-10-03 18:26:32, 2020-10-09 22:25:29, 2020-10-18 18:34:46, 2020-11-01 20:52:39, 2020-11-27 22:19:56, 2020-12-02 22:27:16, 2020-12-02 22..."


In [19]:
df_tip = pdm.read_mongo("tip", [], "mongodb://localhost:27017/yelp")

In [20]:
df_tip.head()

Unnamed: 0,_id,user_id,business_id,text,date,compliment_count
0,605b9f917559a109611bfcaf,wDWoMG5N9oI4DJ-p7z8EBg,XWFjKtRGZ9khRGtGg2ZvaA,"1/2-price bowling & the ""Very"" Old Fashion are excellent, but the drink didn't help my bowling score!",2017-07-11 23:07:16,0
1,605b9f917559a109611bfcb0,fTsVDajAyDJ-YzsSdfXSDw,oQyf1788YWsiDLupGva6sw,Cold cuts are the best,2015-06-09 14:35:57,0
2,605b9f917559a109611bfcb1,JmuFlorjjRshHTKzTwNtgg,mkrx0VhSMU3p3uhyJGCoWA,"Solid gold's. Great sauna. Great staff, too. Even at two am!",2016-11-30 08:46:36,0
3,605b9f917559a109611bfcb2,5u7E3LYp_3eB8dLuUBazXQ,9Bto7mky640ocgezVKSfVg,"Nice people, skilled staff, clean location - but! I don't think I've ever been taken on time. In 2 years.",2013-12-13 23:23:41,0
4,605b9f917559a109611bfcb3,sNVpZLDSlCudlXLsnJpg7A,Wqetc51pFQzz04SXh_AORA,So busy...,2014-06-07 12:09:55,0


In [21]:
df_photos = pdm.read_mongo("photos", [], "mongodb://localhost:27017/yelp")

In [22]:
df_photos.head()

Unnamed: 0,_id,photo_id,business_id,caption,label
0,605ba0c48799c3a0e4eaa733,BFE1AFOs27scnnfeBf99ZA,vdT7zlrLB2DL9pStDUs91A,,drink
1,605ba0c48799c3a0e4eaa734,bnJDeS7YSX7jZM6pTsss2Q,1E2KcGtzZO5v_LgrTiQl9A,,drink
2,605ba0c48799c3a0e4eaa735,rLnw0d-YYZvT9kR4y7h7_Q,aQa7N5ZbPhCoKYGGB-gqfg,,drink
3,605ba0c48799c3a0e4eaa736,YufLd_9qqSjifuBgF6a2xQ,N0Y8MQV8_L_9-nnT3jOy8Q,,drink
4,605ba0c48799c3a0e4eaa737,gPOXcGNQcB2V5pAKCncxOQ,L1nn5Cge3wBUHydmX8XwWA,Weird Logo,drink


## Merge the data

Since we are working with data from multiple files, we need to combine the data into a single DataFrame that will allow us to analyze the different characteristics with respect to our target variable, the `starts` column of Yelp. 

We can do this by merging the multiple DataFrames together, joining them in the columns they have in common. In our case, this unique identifying column is the `business_id`. 

Merge each of the other 4 DataFrames into our new DataFrame `df_merged` to combine all the data together. Make sure that `df_merged` is the left DataFrame in each merge and that you do `left join` on each of them since not all DataFrames include all the businesses in the dataset (this way we won't lose any data during the merges). Once merged, print the `df_merged` columns. 

In [26]:
#df_merged = pd.merge(df_merged, df_user, how="left", on="user_id")
# df_merged = pd.merge(df_merged, df_checkin, how="left", on="business_id")
df_merged = pd.merge(df_merged, df_tip, how="left", on="business_id")

MemoryError: Unable to allocate 75.5 GiB for an array with shape (22, 460573128) and data type object

In [24]:
df_merged.head()

Unnamed: 0,_id_x,business_id,name_x,address,city,state,postal_code,latitude,longitude,stars_x,review_count_x,is_open,attributes,categories,hours,_id_y,review_id,user_id,stars_y,useful_x,funny_x,cool_x,text,date,_id,name_y,review_count_y,yelping_since,useful_y,funny_y,cool_y,elite,friends,fans,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}",605ba0f1d6b0c7d7a78599c6,xYcbW9MPyLdy8fwbIoAyAQ,i9F35UmYkvBirC2Yl9fuqg,4.0,1,0,0,"Quaint little store with tons of amazing items! We purchased a cabinet from them and it looks amazing in our house. They are very helpful and always adding new inventory. If you let them know you are looking for a certain item, they will FB message you when they have new stock coming it! I like that customer attention! Makes me a BIG fan of this store. Can't wait to find my next piece.",2016-08-25 16:52:19,605b9fbb37964792354b93de,J,123,2007-07-18 19:26:42,184,33,21,,"Lh2x6jI3e9GDxWhxNZn9qw, XdP0zXWQ2xH2gZDDhQ58TQ, 3FkXXdF9hWpSYoLBSXqcsA, 2lK2GEU-ger8MjyzYu9FWQ, GdCoLxDQi9LleGuX7FaNVQ, oP6mE-duqFo1xfCuGd7Gxg, jet6MfoImuVQXfteje4adQ, 7N9M3Sui2bG_MJtS1LcnDw, hvi4vSAGJo3Qef-6uHeumA, qACtKBgrvN1BydF4I2qgXA, KBR3XKeLZ3ke0-rgHxmPtA, NdnmvguJGzSmT40Vux4PmA, u4Y-CDYvowTnNyBg8tTkrg, 4-N1BBCPMioFg5LWqxlCpA, ntrmUWGA0DcjeqGpYj0BGA, V4cNYMvvYXHKS-7OuKW4Dw, A-z9v3m3oskdDQKc_WcEQQ, puj3Cwp2Fljfl-hXJxGIxA, LZYDgdosuqKGogDlx28DYw, 4NC5pOroOBgglcikDGiRjQ, oFSScR2qo3IwKvxb...",4,3.39,0,0,0,0,0,1,4,4,4,2,0
1,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}",605ba0f2d6b0c7d7a7864c0e,ktrlT7JkK9QwlLdpF-sTVA,WADswAe3hbMWM0uUxb_rrQ,5.0,4,0,1,"Gasp! AMAZING!\nThey took the idea of restoring and modifying vintage furniture and did it flawlessly!\nThere are other places in town that have tried this and dropped the ball, but not here!\n\nThe craziest part is, the prices aren't nearly as high as they should be!!!\nConsider how good this stuff looks, the time that went into doing it, the item itself, you're getting a steal of a deal on every item in here!\n\nNow for the things in here, when I showed up I saw:\n1 couch\n3 chairs \n1 cra...",2014-10-12 16:50:27,605b9fbe37964792354c9612,Sam,74,2013-06-23 15:35:48,83,14,21,,"7Svy8yWa8SB14GJvisc8UA, KasqD8rfmIJRF2L3RUgkIw",0,3.86,1,0,0,0,0,0,0,2,2,1,0
2,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}",605ba0f3d6b0c7d7a787177a,9p-z0pVIkXPnoVncOKTAtg,SQ87CzgSrsY9H5ankTA6UQ,4.0,0,0,0,You know that shop that doesn't have alot but what it does have is truly lovely. Well this is that place. The selection is incredibly sparse but what is there is excellent.,2015-01-21 04:38:44,605b9fbe37964792354c7dce,Audrey,25,2012-07-14 02:45:03,17,2,5,,"LekhR5IKXQFL7icDnljhPQ, yGiKYAMW_V-l7WNd6uqj-w, MmxGP6u_u3mjVHxdxAE-gQ, rSLHx4elur0-ZwyZyV1LmA, 3n7aXDpP-mqmUosMrJraWA, VCYtRcED5pBCZBwhEUI8Nw",0,4.5,0,0,0,0,0,0,0,0,0,0,0
3,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}",605ba0f6d6b0c7d7a7891f36,RjktBj4W_T078NK60kaUVw,JUW2nLyVUxc_lGVFxmj-VQ,5.0,1,0,0,"I brought in a client and found several pieces for her home. Impeccable restoration and priced well, this store is an absolute gem!",2015-10-26 20:20:49,605b9fc137964792354dfd81,Jennifer,5,2015-10-26 20:20:43,4,1,0,,,0,5.0,0,0,0,0,0,0,0,0,0,0,0
4,605ba08cbf4286a93a81ab52,bvN78flM8NLprQ1a1y5dRg,The Reclaimory,4720 Hawthorne Ave,Portland,OR,97214,45.511907,-122.613693,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'RestaurantsPriceRange2': '2', 'ByAppointmentOnly': 'False', 'BikeParking': 'False', 'BusinessParking': '{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}'}","Antiques, Fashion, Used, Vintage & Consignment, Shopping, Furniture Stores, Home & Garden","{'Thursday': '11:0-18:0', 'Friday': '11:0-18:0', 'Saturday': '11:0-18:0', 'Sunday': '11:0-18:0'}",605ba0f7d6b0c7d7a789fc6b,LwPbnhZBTB37j00ivD_WKw,nkNAiBFCe9se6wyRZWkZYA,5.0,0,0,0,just bought a piece of furniture here. I love the selection of mid century they have and the way they restore pieces is truly amazing. so glad I happened upon this place. will be returning shortly.,2015-06-04 02:27:58,605b9fbd37964792354c25dc,Bobi,12,2013-08-17 16:07:52,14,2,6,,"CJ5hgPWu52us08repSrQpg, USmFiVHCDl8B47TAwKuaHg, ef1v3hfWy5qTnrNP99L9Fg, BnK-jZ5MINIz0cynQVc8oA, 4MdCJmueSke4kM8Y0XbOIA, FlpJt0WwVTLOB5ZGZYKmAA, rBv7ygL9cnqWNm1SkHC6YQ, 9ibjE8sevttXC9kJGjyrMg, nsB7eUeoLJIe-41h2JlcVQ, 0Bm5HuJBcegmgMViJ_MUbw, zOYNi5YeQ8fKa12_RHCdiA, TmF-SD5YEEcmGRiOr_ZnVQ, SPUd4tKv4rkd-wsYeZ1Zow, KZkK2ZHpWFqPVo7O09E7BA, g-pglctPh_Scv0zHdCX2Eg, nGcvS6HZPtrGgwprM8bDXA, 6CHmdD8a6rjZmDL9rh5POA, LQqWgY-4j6jypHRjBj9Dfw, OkSTF5tOMN6RHAwFosGA6g, VQpYXNiEljBGbo0Ct7BPdQ, 0guCm4WPDPzTm_Kb...",0,4.58,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# Need to manage RAM
del [[df_user, df_checkin, df_tip]]
gc.collect()
df_user=pd.DataFrame()
df_checkin=pd.DataFrame()
df_tipw=pd.DataFrame()

In [None]:
import feather
feather.write_dataframe(df_merged, 'yelp_df_merged.file')

In [None]:
df_merged = pd.merge(df_merged, df_photos, how="left", on="business_id")

In [None]:
print(df_merged.columns)

In [None]:
df.head()

In [None]:
print(len(df))

## Data cleaning
We're getting really close to the fun analysis part! We just need to clean up our data a bit so we can focus on the characteristics that might have predictive power in determining an establishment's Yelp rating.

In a Linear Regression model, our features will ideally be continuous variables that affect our dependent variable, or Yelp rating. For this project we will also work with some characteristics that are binary, on the `[0,1]` scale. With this information, we can eliminate any columns in the dataset that are not continuous or binary, and on which we do not want to make predictions. The next cell contains a list of these unnecessary features. Get them out of `df` with the `drop` syntax of Pandas, basically we need to remove the following columns:

```
'address',
'attributes',
'business_id',
'categories',
'city',
'hours',
'is_open',
'latitude',
'longitude',
'name',
'neighborhood',
'postal_code',
'state',
'time'
```

Go ahead and do the above operation in the next cell

In [None]:
features_to_remove = ['address',
'attributes',
'business_id',
'categories',
'city',
'hours',
'is_open',
'latitude',
'longitude',
'name',
'neighborhood',
'postal_code',
'state',
'time']

df.drop(labels=features_to_remove, axis=1, inplace=True)

In [None]:
df.columns

In [None]:
df.head()

Now we just need to check our data to make sure we do not have missing values, or `NaNs`, which will prevent the Linear Regression model from working correctly. To do this we can use the `df.isna().any()` statement. This will check all our columns and return `True` if there are missing values or `NaNs`, or `False` if there are no missing values. Check if `df` is missing any values.

In [None]:
df.isna().any()

As you can see, there are some columns with missing values. Since our dataset has no information recorded for some businesses in these columns, we will assume that the Yelp pages do not show these features. For example, if there is a `NaN` value for `number_pics`, it means that the associated business did not have any images posted on their Yelp page. This way we can replace all our `NaNs` with `0`s. To do this we can use the `.fillna()` method.

Fill the missing values in `df` with `0`. Then, confirm that the missing values have been filled with `df.isna().any()`.

In [None]:
df.fillna({ 'weekday_checkins' : 0,
          'weekend_checkins': 0,
          'average_tip_length': 0,
          'number_tips': 0,
          'average_caption_length': 0,
          'number_pics': 0}, inplace= True)

In [None]:
df.isna().any()

In [None]:
df.head()

## Exploratory analysis

Now that our data is all together, let's investigate some of the different features to see what might correlate the most with our dependent variable, Yelp rating (called `stars` in our DataFrame). The features with the best correlations might be the most useful for our Linear Regression model! 

Pandas DataFrames have a really useful method, `.corr()`, that allows us to see the correlation coefficients for each pair of our different features. Remember, a correlation of `0` indicates that two features have no linear relationship, a correlation coefficient of `1` indicates that two features have a perfect positive linear relationship, and a correlation coefficient of `-1` indicates that two features have a perfect negative linear relationship. 

call `.corr()` on `df`. You will see that `number_funny_votes` has a correlation coefficient of `0.001320` with respect to `stars` or Yelp rating. This is a very weak correlation. Which characteristics correlate best, both positively and negatively, with Yelp rating?

In [None]:
df.corr()

To better visualize these relationships, we can plot certain characteristics against our dependent variable, the Yelp rating, by importing Matplotlib. We can use Matplotlib's `.scatter()` method to plot what these correlations look like (add as a third parameter to the scatter the following `alpha=0.1`)

Plot the three features that correlate the most with Yelp stars (`average_review_sentiment`, `average_review_length`, `average_review_age`) against `stars`.... Then plot a low correlation characteristic, such as `number_funny_votes`, against `stars`.

Note: what is `average_review_sentiment`, `average_review_sentiment` is the average score of all reviews on a business's Yelp page. Yelp calculates this score internally using the VADER sentiment analysis tool. VADER uses a set of positive and negative words, along with coded grammar rules, to estimate how positive or negative a review is. Scores range from `-1`, to `+1`, with a score of `0` indicating a neutral comment. While not perfect, VADER does a good job of guessing sentiment from text data! This is why you will see a high correlation with `stars`.

In [None]:
!pip install matplotlib

In [None]:
from matplotlib import pyplot as plt

plt.scatter(df['average_review_sentiment'], df['stars'], alpha=0.1)
plt.xlabel('average_review_sentiment')
plt.ylabel('Yelp Rating')
plt.show()

In [None]:
plt.scatter(df['average_review_age'], df['stars'], alpha=0.1)
plt.xlabel('average_review_age')
plt.ylabel('Yelp Rating')
plt.show()

In [None]:
plt.scatter(df['number_funny_votes'], df['stars'], alpha=0.1)
plt.xlabel('number_funny_votes')
plt.ylabel('Yelp Rating')
plt.show()

## Data selection
To put our data into a Linear Regression model, we need to separate our features/features/columns to model Yelp stars. 

From our correlation analysis we saw that the three features with the strongest correlations for Yelp stars are `average_review_sentiment`, `average_review_length`, and `average_review_age`. 

Since we want to go a little deeper than just using `average_review_sentiment`, which understandably has a very high correlation with Yelp stars, let's choose to create our first model with `average_review_length` and `average_review_age` as features.

Create a new column in the DataFrame containing the columns we want to model on and call it `features` with the columns: `average_review_length` and `average_review_age`. Then create another DataFrame called `ratings` that stores the value we want to predict, the Yelp stars `stars` in `df`.

In [None]:
features = df[['average_review_length', 'average_review_age']]
ratings = df['stars']

## Split data into training and test sets.
We are almost ready to model! But first, we need to split our data into a training set and a test set so we can evaluate how well our model works. 

We will use scikit-learn's `train_test_split` function to do this split. This function takes two required parameters: the data, or our features, followed by our dependent variable, in our case Yelp stars. Set the optional parameter `test_size` to `0,2`. Finally, set the optional parameter `random_state` to 1. This will cause your data to be split in the same way as the data in our solution code. Assign values to the following variables: `X_train, X_test, y_train, y_test`.

In [None]:
!pip3 install sklearn

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size=0.2, random_state=1)

## Create and train the model
Now that our data is divided into training and test sets, we can finally model! Import `LinearRegression` from the `linear_model` module of scikit-learn. 

Create a new `LinearRegression` object named `model`. The `.fit()` method will fit our Linear Regression model to our training data and calculate the coefficients for our features. Call the `.fit()` method on the model with `X_train` and `y_train` as parameters. This way, our model will be trained on our training data!

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

## Evaluating and understanding the model
Now we can evaluate our model in several ways. The first way will be to use the `.score()` method, which provides the `R^2` value for our model. Remember, `R^2` is the coefficient of determination, or a measure of how much of the variance in our dependent variable, the predicted Yelp stars, is explained by our independent variables, our feature data. 

The values of `R^2` range from `0` to `1`, with `0` indicating that the model created does not fit our data at all, and with `1` indicating that the model fits our feature data perfectly. 

Call `.score()` on our model with `X_train` and `y_train `as parameters to calculate our `R^2` score. Then call `.score()` again on the model with `X_test` and `y_test` as parameters to calculate `R^2` for our test data. 

What do these `R^2` values say about our model? Do you think these features alone are capable of effectively predicting Yelp stars?

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

After so much work, we can finally take a look at the coefficients of our different features! 

The model has a `.coef_` attribute which is an array of the feature coefficients determined by fitting our model to the training data. To make it easier to see which feature corresponds to which coefficient, we have provided a code in the cell that links a list of our features with the coefficients and sorts them in descending order from most predictive to least predictive.

```
sorted(list(zip(['average_review_length','average_review_age'],model.coef_)),key = lambda x: abs(x[1]),reverse=True)
```

Copy and paste it into the following cell

In [None]:
sorted(list(zip(['average_review_length','average_review_age'],model.coef_)),key = lambda x: abs(x[1]),reverse=True)

Finally, we can calculate the predicted Yelp stars for our test data and compare them to the actual Yelp stars. 

Our model has a `.predict()` method that uses the model coefficients to calculate the predicted Yelp value. Call `.predict()` on `X_test` and assign the values to `y_predicted`. 

Use Matplotlib to plot `y_test` vs `y_predicted`. For a perfect linear regression model, we would expect to see the data plotted along the `y = x` line, is this the case? If not, why not? 

In [None]:
y_predicted = model.predict(X_test)
print(y_predicted)

In [None]:
plt.scatter(y_test, y_predicted)
plt.xlabel('Yelp Rating')
plt.ylabel('Predicted Yelp Rating')
plt.show()

In [None]:
print(y_test)

In [None]:
print(y_predicted)

## Define different subsets of data.
After evaluating the first model, you can see that `average_review_length` and `average_review_age` alone are not the best predictors for Yelp stars. 

Let's do some more modeling with different subsets of features and see if we can achieve a more accurate model! 

In the cells below we have provided different lists of feature subsets that we will model and evaluate with. What other feature subsets would you like to test? Why do you think those feature sets are more predictive of Yelp stars than others? Create at least one additional subset of features from which you would like to predict Yelp stars. Copy and paste the subsets into the following cell.

```
# subset of only average review sentiment
sentiment = ['average_review_sentiment']

# subset of all features that have a response range [0,1]
binary_features = ['alcohol?','has_bike_parking','takes_credit_cards','good_for_kids','take_reservations','has_wifi']

# subset of all features that vary on a greater range than [0,1]
numeric_features = ['review_count','price_range','average_caption_length','number_pics','average_review_age','average_review_length','average_review_sentiment','number_funny_votes','number_cool_votes','number_useful_votes','average_tip_length','number_tips','average_number_friends','average_days_on_yelp','average_number_fans','average_review_count','average_number_years_elite','weekday_checkins','weekend_checkins']

# all features
all_features = binary_features + numeric_features
```

In [None]:
# subset of only average review sentiment
sentiment = ['average_review_sentiment']

# subset of all features that have a response range [0,1]
binary_features = ['alcohol?','has_bike_parking','takes_credit_cards','good_for_kids','take_reservations','has_wifi']

# subset of all features that vary on a greater range than [0,1]
numeric_features = ['review_count','price_range','average_caption_length','number_pics','average_review_age','average_review_length']

# all features
all_features = binary_features + numeric_features

In [None]:
df.head()

## Other models
Now that we have lists of different subsets of features, we can create new models from them. In order to more easily compare the performance of these new models, we have created a function for you called `model_these_features()`. 

This function replicates the model building process you just completed with our first model! Take some time to review how it works, analyzing it line by line. Fill in the empty comments with an explanation of the task the code below is performing.

Import Numpy and copy and paste the function into the following cell

```
import numpy as np

# take a list of features to model as a parameter
def model_these_features(feature_list):

    # define ratings and features, with the features limited to our chosen subset of data
    ratings = df.loc[:,'stars']
    features = df.loc[:,feature_list]

    # perform train, test, split on the data
    X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)

    # don't worry too much about these lines, just know that they allow the model to work when
    # we model on just one feature instead of multiple features. Trust us on this one :)
    if len(X_train.shape) < 2:
        X_train = np.array(X_train).reshape(-1,1)
        X_test = np.array(X_test).reshape(-1,1)

    # create and fit the model to the training data
    model = LinearRegression()
    model.fit(X_train,y_train)

    # print the train and test scores
    print('Train Score:', model.score(X_train,y_train))
    print('Test Score:', model.score(X_test,y_test))

    # print the model features and their corresponding coefficients, from most predictive to least predictive
    print(sorted(list(zip(feature_list,model.coef_)),key = lambda x: abs(x[1]),reverse=True))

    # calculate the predicted Yelp ratings from the test data
    y_predicted = model.predict(X_test)

    # plot the actual Yelp Ratings vs the predicted Yelp ratings for the test data
    plt.scatter(y_test,y_predicted)
    plt.xlabel('Yelp Rating')
    plt.ylabel('Predicted Yelp Rating')
    plt.ylim(1,5)
    plt.show()
```

In [None]:
!pip install numpy

In [None]:
import numpy as np

# take a list of features to model as a parameter
def model_these_features(feature_list):

    # define ratings and features, with the features limited to our chosen subset of data
    ratings = df.loc[:,'stars']
    features = df.loc[:,feature_list]

    # perform train, test, split on the data
    X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)

    # don't worry too much about these lines, just know that they allow the model to work when
    # we model on just one feature instead of multiple features. Trust us on this one :)
    if len(X_train.shape) < 2:
        X_train = np.array(X_train).reshape(-1,1)
        X_test = np.array(X_test).reshape(-1,1)

    # create and fit the model to the training data
    model = LinearRegression()
    model.fit(X_train,y_train)

    # print the train and test scores
    print('Train Score:', model.score(X_train,y_train))
    print('Test Score:', model.score(X_test,y_test))

    # print the model features and their corresponding coefficients, from most predictive to least predictive
    print(sorted(list(zip(feature_list,model.coef_)),key = lambda x: abs(x[1]),reverse=True))

    # calculate the predicted Yelp ratings from the test data
    y_predicted = model.predict(X_test)

    # plot the actual Yelp Ratings vs the predicted Yelp ratings for the test data
    plt.scatter(y_test,y_predicted)
    plt.xlabel('Yelp Rating')
    plt.ylabel('Predicted Yelp Rating')
    plt.ylim(1,5)
    plt.show()

Once you are comfortable with the function steps, run models on the following subsets of data using `model_these_features()`:

`sentiment`: only `average_review_sentiment`.

`binary_features`: all features that have a response range `[0,1]`.

`numeric_features`: all features that vary in a range greater than `[0,1]`.

`all_features`: all characteristics

`feature_subset`: its own subset of features

How does changing feature sets affect the `R^2` value of the model? Which features are most important for predicting Yelp stars in different models?

In [None]:
model_these_features(sentiment)

In [None]:
model_these_features(binary_features)

In [None]:
model_these_features(all_features)

## Debut of your new restaurant - Pizzas and Pizzas
You've loaded the data, cleaned it up, modeled it and evaluated it. You're tired, but beaming with pride after all the hard work. You close your eyes and you can clearly see the opening day of Pizzas and Pizzas with a line of people out the door. But how many will be your Yelp stars? Let's use our model to make a prediction.

Our best model was the one that used all the charecteristics or features!!!, so we will work with this model again. In the cell below print `all_features` to get a reminder of the features we are working with.

In [None]:
print(all_features)

Run the cell below to grab all the features and retrain our model on them.

```
features = df.loc[:,all_features]
ratings = df.loc[:,'stars']
X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)
model = LinearRegression()
model.fit(X_train,y_train)
```

In [None]:
features = df.loc[:,all_features]
ratings = df.loc[:,'stars']
X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)
model = LinearRegression()
model.fit(X_train,y_train)

To give you a perspective on the restaurants that already exist, we have provided the average, minimum and maximum values for each feature/column below. Will Pizzas and Pizzas be another average restaurant, or will it be a 5-star giant among the masses?

```
pd.DataFrame(list(zip(features.columns,features.describe().loc['mean'],features.describe().loc['min'],features.describe().loc['max'])),columns=['Feature','Mean','Min','Max'])
```


In [None]:
pd.DataFrame(list(zip(features.columns,features.describe().loc['mean'],features.describe().loc['min'],features.describe().loc['max'])),columns=['Feature','Mean','Min','Max'])

Based on your plans for the restaurant, how do you expect your customers to rate on your Yelp page for each of the features? Fill in the blanks in the NumPy array below with your desired values. 

The first blank corresponds to the feature at `index=0` in the DataFrame above, `alcohol` and the last blank corresponds to the feature at `index=24`, `weekend_checkins`. Be sure to enter `0` or `1` for all binary characteristics, and if you are not sure what value to put for a characteristic, select the mean in the DataFrame above. 

Save the numpy array in a variable named `pizzas_pizzas` and remember to `reshape(1, -1)` it.

After entering the values, run the prediction cell below to receive your Yelp rating! What will the debut of Pizzas and Pizzas look like?

In [None]:
print(all_features)

In [None]:
pizzas_pizzas = np.array([1, 1, 1, 1, 1, 1, 10, 2, 0, 100, 4, 350]).reshape(1, -1)

In [None]:
model.predict(pizzas_pizzas)

## Next steps

You have successfully built a linear regression model that predicts a restaurant's Yelp stars! As you've seen, it can be quite difficult to predict stars like this even when we have a plethora of data. 

What other questions come to mind when you see the data we have? What insights do you think you could predict from a different type of analysis? Here are some ideas to ponder:

- Can we predict the type of cuisine in a restaurant based on the users who review it?

- Which restaurants are similar to each other in other ways besides the type of cuisine?

- Is there a different ambiance in restaurants, and which types of restaurants fit these concepts?

- How does social media status affect a restaurant's credibility and visibility?

As you advance in the field of data science, you'll be able to create models that address these questions and many more. But in the meantime, congratulations, you have achieved a great accomplishment!!!