# EXERCISES: CHAPTER 6
## Data Loading, Storage, and File Formats

In [None]:
import numpy as np
import pandas as pd

## Exercise 0: Generate a random CSV file and read it in

### Preparation
0. Go to this site: https://mockaroo.com/
1. Try playing around with picking different types of mock data if you want
2. Make sure the 'Format' is set to CSV (should be the default).
3. Scroll down a bit, click 'Download Data', and save to wherever you want.

### Task 0
Read in the mocked data as a data frame.

### Solution 0

In [None]:
# 0. Copy the file path to your file (if you don't know how to do this quickly, search for instructions for your operating system) 
# 1. Paste your file path to a Python script like this one, and run it.
fpath = '/Users/lowe/Downloads/MOCK_DATA.csv'
df = pd.read_csv(fpath)
df

### Task 1
Save the produced dataframe's contents to a new .csv file, using `;` as the field delimiter.

### Solution 1

In [None]:
df.to_csv('foobar.csv', sep=';')

## Exercise 1: Copy a HTML table from a webpage

### Preparation
0. Go to this Wikipedia page: https://en.wikipedia.org/wiki/List_of_C-family_programming_languages
1. Scroll down to the table of different programming languages

### Task
Copy the header and the first five rows (down to the 'C++' row) from the web page's table into a data frame.

### Solution

0. Highlight the header and the first five rows by simply using your mouse. 
1. Use Ctrl+C (Cmd+C) to copy the highlighted text to your clipboard. 
2. Go to your Jupyter notebook and run the below code.

In [None]:
import pandas as pd
df = pd.read_clipboard()
df

## Exercise 2: Convert a Python list of dictionaries to JSON

### Preparation
Run the code snippet below to create a list of dictionaries

In [None]:
people = [
    {'name': 'Breonna', 'age': 26, 'profession': 'Nurse'},
    {'name': 'Ada', 'age': 36, 'profession': 'Programmer', 'Interests': ['Poetry', 'Poetical Science']},
    {'name': 'Charles', 'age': 79, 'profession': 'Programmer'}
]

### Task

Convert the data in `people` to a string of JSON-formatted data.

### Solution

In [None]:
import json
people_json = json.dumps(people)
people_json

## Exercise 3: Copy a HTML table from a webpage, again

### Preparations

0. Install (`pip install`/`conda install`) the packages `lxml`, `beautifulsoup4` and `html5lib`, if you don't already have them
1. Go to https://c19.se/ , which has COVID-19 statistics for Sweden
2. Scroll down to where you see a table of by-region statistics (actually two tables, since the summary statistics are in a HTML table element of their own)

### Task

Convert/import the by-region HTML table to a pandas dataframe and sort it by total number of cases (the 'Fall' column).

### Solution

In [None]:
# 0. Right click the webpage in your web browser and choose 'Save Page As...'
# 1. Copy the file path to the downloaded .html file.
# 2. Paste the file path into a code snippet like this one.
fpath = '/Users/lowe/Desktop/C19.SE - Coronavirus i Sverige.html'
# let pandas parse the html, creating a list of dataframes
covid_frames = pd.read_html(fpath)

In [None]:
# check how many tables were found (/data frames were created)
len(covid_frames)

In [None]:
# take a look at the first dataframe
covid_frames[0]

In [None]:
# that's what we want, so make a copy that we can edit without
# changing the original data (in case we want to start over later)
covid_regions_df = covid_frames[0].copy()

In [None]:
# check the data type of the 'Fall idag' (cases today) column
covid_regions_df['Fall'].dtype

In [None]:
# the 'number of cases today' column isn't reliably interpreted as being of integer type
# (so you might get `dtype('0')` above) which is a problem since we want to sort by numeric value.
# in that case, you need to clean and convert the data using something like this code
# snippet.

# remove all whitespace, e. g. '2 830'->'2830'
covid_regions_df['Fall'] = covid_regions_df['Fall'].str.replace(' ', '')
# convert to integer type
covid_regions_df['Fall'] = covid_regions_df['Fall'].astype(int)

In [None]:
# sort by number of cases today
covid_regions_df.sort_values('Fall', inplace=True)
covid_regions_df

## Exercise 4: Fetch JSON data from an API

### Preparation

API stands for **Application Programming Interface**. It's a broad term that roughly describes *an interface, a set of commands, that can be used to make some piece of technology and/or software do things*. In the context of databases, an API means an interface that allows you to retrieve or manipulate the contents of a database. For a simple example, say you have a database about ice creams, which holds each ice cream's name and price. You could then let outside users connect to your database and tell it to send them a list of all the ice creams in your database, or to delete all the ice creams in the database. You could call these two commands "GET ALL ICECREAM" and "DELETE ALL ICECREAM". The users don't know how exactly your database actually executes the commands, only how to give them. That's an API consisting of a set of two instructions. Obviously it has its drawbacks, and API's in the wild usually offer more comprehensive and refined sets of commands, as we'll see.

In this exercise we interact with the [Open Movie Database](https://www.omdbapi.com/). Its API is relatively beginner-friendly and has been stable for a long time, so it's easy to get started. To use it you need to get an *API key* (essentially a 'username' and 'password' in a single code). Luckily anyone can register for a key and all you need is an e-mail address. Go to [this page](https://www.omdbapi.com/apikey.aspx) to register, choosing the 'free' option.  Check your e-mail inbox for a message with your API key, then briefly read through the [Usage](https://www.omdbapi.com/#usage) and [Parameters](https://www.omdbapi.com/#parameters) sections. Finally, try a couple of [Examples](https://www.omdbapi.com/#examples).

If you don't have it yet (if you have Anaconda you're good to go), `pip install` the **requests** package.

*(note: fetching data about movies might not feel very 'data science-y', of course. the point here is simply to get a feel for how interacting with an API works, because this in itself can become quite complicated)*

### Task 0

Retrieve data for all movies from 1951 with 'cat' in the title and put them in a pandas dataframe. (1951 is chosen simply because it makes for a very manageable dataset of a few, rather than hundreds of, movies)

### Solution 0

1. Use the web site's 'examples' form to do a search with 'Title' set to 'cat' and 'Year' set to '1951'. You'll see that the request URL looks like this: 'http://www.omdbapi.com/?t=cat&y=1951'. This is a URL ('Universal Resource Locator', very roughly an 'internet address') that applies the filters that we want. The most important parts are the parameters at the end, 't=cat&y=1951', basically meaning 'title must include cat, and year must be 1951'.
2. Look at the single-record response shown below the search form to see how each movie's data are formatted. Each record is structured as a 'dictionary', like this:
> {"Title":"Cat Napping","Year":"1951","Rated":"Approved","Released":"08 Dec 1951","Runtime":"7 min","Genre":"Animation, Short, Comedy, Family","Director":"Joseph Barbera, William Hanna","Writer":"N/A","Actors":"N/A","Plot":"Tom's getting ready to settle into the hammock, but Jerry has beat him to it and the battle begins.","Language":"English","Country":"USA","Awards":"N/A","Poster":"https://m.media-amazon.com/images/M/MV5BZTYwMGY0NGMtOTQzMy00YjVmLTk5YmEtZjZiNjRiM2VhNzVkXkEyXkFqcGdeQXVyNjMxODMyODU@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.5/10"}],"Metascore":"N/A","imdbRating":"7.5","imdbVotes":"643","imdbID":"tt0043387","Type":"movie","DVD":"N/A","BoxOffice":"N/A","Production":"N/A","Website":"N/A","Response":"True"}

This format is called a 'records' format by pandas, as can be seen in the [documentation for `read_json`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html)

3. Note that the examples form uses the 't' parameter, which means that the response will only include a **single** movie. We however want **all** movies from 1951 that have 'cat' in their titles. If we look at the [Parameters](https://www.omdbapi.com/#parameters) section again we see that to do a proper search (that can return multiple movies) we need to use the 's' parameter instead of 't'.

3. Based on the above, we know that we want to send a GET (this is the default in the 'requests' package) request to 'http://www.omdbapi.com'. The request should have the following parameters:

* apikey: --API key sent in e-mail--
* s: cat
* y: 1951

4. Use the following code, which makes use of the information you collected in the previous steps.

In [109]:
# import the requests package and the JSON package
import requests
import json

# specify the address to the API site (the 'API endpoint')
api_endpoint = 'http://www.omdbapi.com'
# form a dictionary of request parameters
url_parameters = {
    'apikey': 'INSERTAPIKEYHERE',
    's': 'cat',
    'y': 1951
}

# use the `requests` package's `get` function to make a
# GET ('retrieve data') request to the API endpoint, 
# including the request parameters.
# this produces a Response object, which includes
# a lot of information about the API's response
requests_resp = requests.get(api_endpoint, url_parameters)
type(requests_resp)

requests.models.Response

In [None]:
# we are only interested in the data sent back by the API,
# which are in the Response object's 'content' attribute
raw_json = requests_resp.content
# let's see how the actual search query response is formatted
raw_json[:400]

In [None]:
# the movie data are inside of a 'list' (array) linked to the 'Search' key.
# this means that giving the data directly to pandas, only specifying
# the 'records' format, won't work.
# instead we can use Python's json package to convert from JSON
# to a python dictionary, extract the data as a list, and hand it
# off to pandas. 

# first, we use the json package's
# `json.reads` (read string) function to 
# convert from JSON to dictionary
data_dict = json.loads(raw_json)
data_dict

In [None]:
# we now extract the list of data
data_ls = data_dict['Search']

In [None]:
# finally we pass the list of data to pandas' DataFrame constructor
movie_cat1951_df = pd.DataFrame(data_ls)

In [None]:
# now all movies from 1951 with 'cat' in the title (that are in the
# open movie database) are in the dataframe
movie_cat1951_df

### Task 1

Figure out and compare the number of movies from 1951 with 'dog' in the title with the number of 1951 movies with 'cat' in the title.

### Solution 1

In [None]:
url_parameters = {
    'apikey': 'INSERTAPIKEYHERE',
    's': 'dog', # NEW
    'y': 1951
}

requests_resp = requests.get(api_endpoint, url_parameters)
raw_json = requests_resp.content
data_dict = json.loads(raw_json)
data_ls = data_dict['Search']

movie_dog1951_df = pd.DataFrame(data_ls) # NEW
movie_dog1951_df

In [None]:
# get the number of rows in each data frame to figure out the number of
# movies in each category
num_dog_movies = len(movie_dog1951_df)
num_cat_movies = len(movie_cat1951_df)
# compare the numbers to see what the correct conclusion is
if num_dog_movies > num_cat_movies:
    conclusion = "there were more dog than cat movies"
elif num_dog_movies < num_cat_movies:
    conclusion = "there were more cat than dog movies"
else:
    conclusion = "there were as many cat as dog movies"
# print the results
print(f"There were " + str(num_dog_movies) + " dog movies and " + str(num_cat_movies) + " cat movies in 1951, meaning that " + conclusion)