# 1. Twitch API data

The URL `https://wind-bow.glitch.me/twitch-api/channels/{CHANNEL_NAME}` is an API from twitch to get data from twitch channels.

Get the data from the following channels:

```
["ESL_SC2", "OgamingSC2", "cretetion", "freecodecamp", 
    "storbeck", "habathcx", "RobotCaleb", "noobs2ninjas",
    "ninja", "shroud", "Dakotaz", "esltv_cs", "pokimane", 
    "tsm_bjergsen", "boxbox", "wtcn", "a_seagull",
    "kinggothalion", "amazhs", "jahrein", "thenadeshot", 
    "sivhd", "kingrichard"]
```

To make into a dataframe that looks like this:

![](twitch.png)

In [21]:
import requests
import json
import pandas as pd

df = pd.DataFrame()

channels = ["ESL_SC2", "OgamingSC2", "cretetion", "freecodecamp", 
    "storbeck", "habathcx", "RobotCaleb", "noobs2ninjas",
    "ninja", "shroud", "Dakotaz", "esltv_cs", "pokimane", 
    "tsm_bjergsen", "boxbox", "wtcn", "a_seagull",
    "kinggothalion", "amazhs", "jahrein", "thenadeshot", 
    "sivhd", "kingrichard"]

key_maps = {
    '_id': 'id',
    'display_name': 'display_name',
    'status': 'status',
    'followers': 'followers',
    'views': 'views'
}

req_url = 'req_url = https://wind-bow.glitch.me/twitch-api/channels/{CHANNEL_NAME}'

for channel in channels:
    
    try:
        r = requests.get('https://wind-bow.glitch.me/twitch-api/channels/{CHANNEL_NAME}'.format(CHANNEL_NAME=channel))
        if r.status_code == 200:
            j = json.loads(r.content)
            ob = {}
            for key in key_maps.keys():
                if key in j:
                    ob[key] = j[key]
            df = df.append(ob, ignore_index=True)
        else:
            print("Got {code} for channel {CHANNEL}".format(
                code=r.status_code,
                CHANNEL=channel
            ))
    except requests.exceptions.HTTPError as err:
        print(err)
    except requests.exceptions.RequestException as e:  # This is the correct syntax
        print(e)
df

Got 404 for channel ninja
Got 404 for channel shroud
Got 404 for channel Dakotaz
Got 404 for channel esltv_cs
Got 404 for channel pokimane
Got 404 for channel tsm_bjergsen
Got 404 for channel boxbox
Got 404 for channel wtcn
Got 404 for channel a_seagull
Got 404 for channel kinggothalion
Got 404 for channel amazhs
Got 404 for channel jahrein
Got 404 for channel thenadeshot
Got 404 for channel sivhd
Got 404 for channel kingrichard


Unnamed: 0,_id,display_name,followers,status,views
0,30220059.0,ESL_SC2,135394.0,RERUN: StarCraft 2 - Terminator vs. Parting (P...,60991791.0
1,71852806.0,OgamingSC2,40895.0,UnderDogs - Rediffusion - Qualifier.,20694507.0
2,90401618.0,cretetion,908.0,It's a Divison kind of Day,11631.0
3,79776140.0,FreeCodeCamp,10122.0,Greg working on Electron-Vue boilerplate w/ Ak...,163747.0
4,86238744.0,storbeck,10.0,,1019.0
5,6726509.0,Habathcx,14.0,Massively Effective,764.0
6,54925078.0,RobotCaleb,20.0,Code wrangling,4602.0
7,82534701.0,noobs2ninjas,835.0,Building a new hackintosh for #programming and...,48102.0


# 2. App Store Reviews

The Apple app store has a `GET` API to get reviews on apps. The URL is:

```
https://itunes.apple.com/{COUNTRY_CODE}/rss/customerreviews/id={APP_ID_HERE}/page={PAGE_NUMBER}/sortby=mostrecent/json
```

Note that you need to provide:

- The country code (eg. `'us'`, `'gb'`, `'ca'`, `'au'`) 

- The app ID. This can be found in the web page for the app right after `id`. For instance, Candy Crush's US webpage is:

`https://apps.apple.com/us/app/candy-crush-saga/id553834731`

So here the ID would be `553834731`.

- The "Page Number". The request responds with multiple pages of data, but sends them one at a time. So you can cycle through the data pages for any app on any country.

### 2.1 English app reviews

Get all english reviews you can for Candy Crush, Tinder, the Facebook app and Twitter (you have to get them from all the english-speaking countries you can think of!).

### 2.2 Best version

For each app, get the version that is the best rated.

Make a visualization of the ratings per versions per app to show this.

### 2.3 Top words

Which word for each app is most common in the 5 star and in the 1-star review's titles?

Note: `df.title.str.get_dummies()` is your friend

Note: This might create a lot of data! Try to break down your analysis in chunks if it doesn't work.

**2.1 English Apps**

In [52]:
from urllib.request import urlopen
import time

app_ids={
    'tinder': '547702041',
    'facebook': '284882215',
    'twitter': '333903271',
    'candy-crush': '553834731'
}
country_codes=[
    'ca',
    'us',
    'aus',
    'nz',
    'ie',
    'gb'
]


def is_error_response(http_response, seconds_to_sleep: float = 1) -> bool:
    """
    Returns False if status_code is 503 (system unavailable) or 200 (success),
    otherwise it will return True (failed). This function should be used
    after calling the commands requests.post() and requests.get().

    :param http_response:
        The response object returned from requests.post or requests.get.
    :param seconds_to_sleep:
        The sleep time used if the status_code is 503. This is used to not
        overwhelm the service since it is unavailable.
    """
    if http_response.status_code == 503:
        time.sleep(seconds_to_sleep)
        return False

    return http_response.status_code != 200

def get_json(url):# -> typing.Union[dict, None]:
    """
    Returns json response if any. Returns None if no json found.

    :param url:
        The url go get the json from.
    """
    response = requests.get(url)
    if is_error_response(response):
        return None
    json_response = response.json()
    return json_response

In [99]:
itunes_url = 'https://itunes.apple.com/{COUNTRY_CODE}/rss/customerreviews/id={APP_ID_HERE}/page={PAGE_NUMBER}/sortby=mostrecent/json'

apps_r = []

def get_reviews_for(app_name, in_country, at_page=1):
    
    global app_ids
    app_id = app_ids[app_name]
    reviews = []
    
    while True:
        url = (f'https://itunes.apple.com/{in_country}/rss/customerreviews/page={at_page}/id={app_id}/sortby=mostrecent/json')
        json = get_json(url)

        if not json:
            return reviews

        feed = json.get('feed')
        
        try:
            if not feed.get('entry'):
                get_reviews_for(app_id, in_country, at_page + 1)
            reviews += [
                {
                    'review_id': entry.get('id').get('label'),
                    'app': app_name,
                    'title': entry.get('title').get('label'),
                    'author': entry.get('author').get('name').get('label'),
                    'author_url': entry.get('author').get('uri').get('label'),
                    'version': entry.get('im:version').get('label'),
                    'rating': entry.get('im:rating').get('label'),
                    'review': entry.get('content').get('label'),
                    'vote_count': entry.get('im:voteCount').get('label'),
                    'page': at_page
                }
                for entry in feed.get('entry')
                if not entry.get('im:name')
            ]
            at_page += 1
        except Exception as e:
            return reviews

for country in country_codes:
    for app in app_ids.keys():
        print("Fetching", app, "for", country)
        apps_r += get_reviews_for(app, country)
print("Done")

Fetching tinder for ca
Fetching facebook for ca
Fetching twitter for ca
Fetching candy-crush for ca
Fetching tinder for us
Fetching facebook for us
Fetching twitter for us
Fetching candy-crush for us
Fetching tinder for aus
Fetching facebook for aus
Fetching twitter for aus
Fetching candy-crush for aus
Fetching tinder for nz
Fetching facebook for nz
Fetching twitter for nz
Fetching candy-crush for nz
Fetching tinder for ie
Fetching facebook for ie
Fetching twitter for ie
Fetching candy-crush for ie
Fetching tinder for gb
Fetching facebook for gb
Fetching twitter for gb
Fetching candy-crush for gb
Done


In [101]:
apps = pd.DataFrame()
for app in apps_r:
    apps = apps.append(app, ignore_index=True)
print('Done')
apps['rating'] = apps['rating'].astype(float)
apps

Done


Unnamed: 0,app,author,author_url,page,rating,review,review_id,title,version,vote_count
0,tinder,cayercanada,https://itunes.apple.com/ca/reviews/id682447433,1.0,1.0,Tinder’s algorithm sucks more than anyone on t...,7062159971,Worst algorithm,12.3.0,0
1,tinder,bbsidnevg,https://itunes.apple.com/ca/reviews/id230770468,1.0,1.0,"App is getting worse over time, withholding li...",7062084239,Getting worse,12.3.0,0
2,tinder,skekfbrbbe,https://itunes.apple.com/ca/reviews/id468342730,1.0,2.0,Je suis extremely déçu je suis payer 20$ pars ...,7061761432,Trop cher,12.3.0,0
3,tinder,Breehina,https://itunes.apple.com/ca/reviews/id166568072,1.0,1.0,super laggy making it not worth the time to use,7061647624,Bad,12.3.0,0
4,tinder,The tailt,https://itunes.apple.com/ca/reviews/id120446882,1.0,1.0,This thing's a scam or I'm the ugliest sob out...,7060895420,Scam,12.2.0,0
...,...,...,...,...,...,...,...,...,...,...
8495,candy-crush,lowintheshow,https://itunes.apple.com/gb/reviews/id1173609209,10.0,5.0,Candy crush is a fun game where you can listen...,6004441682,Candy crush,1.177.1.3,0
8496,candy-crush,superman1347722,https://itunes.apple.com/gb/reviews/id982175838,10.0,1.0,This game is too addictive and the lives can s...,6004395551,ADDICTIVE,1.177.1.3,0
8497,candy-crush,fix this bad app,https://itunes.apple.com/gb/reviews/id1173286148,10.0,3.0,The latest update is really bad you can only u...,6000475696,Lates update change the ability to use the scr...,1.176.0.2,0
8498,candy-crush,zxcbnikl,https://itunes.apple.com/gb/reviews/id133247271,10.0,1.0,This app is specifically designed to defraud p...,5999995913,Absolute con,1.176.0.2,0


### 2.2 Best version

For each app, get the version that is the best rated.

Make a visualization of the ratings per versions per app to show this.

In [112]:
gp = apps.groupby(['app','version'], as_index=False)['rating'].mean()#pd.DataFrame({'average_rating' : }).reset_index()
gp.dropna()
gp = gp[gp['version'] != '']
gp

Unnamed: 0,app,version,rating
0,candy-crush,1.100.0,4.550000
1,candy-crush,1.101.0,4.750000
2,candy-crush,1.102.1,4.642857
3,candy-crush,1.103.0,4.903226
4,candy-crush,1.104.0,4.689655
...,...,...,...
571,twitter,8.55.1,2.800000
572,twitter,8.6,4.000000
573,twitter,8.7.1,4.333333
574,twitter,8.8,3.666667


In [116]:
print("The top rated app versions for each app is:")
(gp.sort_values(['app','rating'], ascending=[True, False])
 .drop_duplicates(['app']).reset_index(drop=True)
)

The top rated app versions for each app is:


Unnamed: 0,app,version,rating
0,candy-crush,1.106.0,5.0
1,facebook,238.0,5.0
2,tinder,5.4.1,5.0
3,twitter,6.19.1,5.0


### 2.3 Top words

Which word for each app is most common in the 5 star and in the 1-star review's titles?

Note: `df.title.str.get_dummies()` is your friend

Note: This might create a lot of data! Try to break down your analysis in chunks if it doesn't work.

In [150]:
tops = pd.DataFrame()
for app in app_ids.keys():
    for rating in [5., 1.]:
        tops = tops.append({
            'app': app,
            'rating': rating
        }, ignore_index=True)

def get_all_words(x, df=apps):
    filt = df[
        (df['rating'] < (x['rating'] + 0.2)) 
        & (df['rating'] > (x['rating'] - 0.2))
        & (df['app'] == x['app'])
    ]
    """
    We flatten the title column into a single string
    and measure the word frequencies
    """
    titles = pd.DataFrame(
        filt.title
            .str
            .split(expand=True)
            .stack()
            .value_counts()
    ).reset_index()
    return pd.Series({
        'top_word': titles.iloc[0]['index'],
        'freq': titles.iloc[0][0]
    })
    
    
tops = tops.merge(
    tops.apply(lambda x: get_all_words(x), axis=1),
    left_index=True,
    right_index=True
)

tops

Unnamed: 0,app,rating,top_word,freq
0,tinder,5.0,Great,15
1,tinder,1.0,for,185
2,facebook,5.0,Facebook,6
3,facebook,1.0,Censorship,108
4,twitter,5.0,Twitter,96
5,twitter,1.0,Twitter,70
6,candy-crush,5.0,Candy,187
7,candy-crush,1.0,to,46


# 3 (STRETCH) IMDB scraping

IMDB has structured web pages. We can exploit this to scrape movie data.

Usinf the following URL:

`https://www.imdb.com/search/title/?groups=top_1000&start={PAGE_NUMBER}&ref_=adv_nxt`

With the following headers in your `GET` request: `{"Accept-Language": "en-US,en;q=0.5"}`

You can generate a dataframe like this one by cycling over the page numbers in the URL requested:

![](IMDB.png)

Note that the following  page attribues will be of interest:

- `div` with a class of `lister-item mode-advanced`

- Various `span` objects within that `div` like `lister-item-year` and `runtime` and `metascore`