<a href="https://colab.research.google.com/github/dylanwalker/BA865/blob/master/BA865_Lecture_04_Exercise_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Preface

In [0]:
# imports for modules we will use:
import pandas as pd
import numpy as np
import seaborn as sns
import feather
import datetime
import pandas_datareader.data as web
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (20, 10)}) # set font and plot size


# Some code to make displaying multiple dataframes side by side better
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)


# Some code to generate example dataframes
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)

# Load imdbb datasets that we'll use
imdbFile = 'https://raw.githubusercontent.com/dylanwalker/BA865/master/datasets/IMDB-Movie-Data.csv'
movies_df = pd.read_csv(imdbFile, index_col="Title")
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 
                     'rating', 'votes', 'revenue_millions', 'metascore']

planets = sns.load_dataset('planets')

titanic = sns.load_dataset('titanic')

stDate = datetime.datetime(2020,1,1)
enDate = datetime.datetime(2020,1,7)
stocks = ["AMZN","MSFT","NVDA","NTDOY", "AAPL"]
stocks_df = pd.concat([ web.DataReader(st,'yahoo',stDate,enDate).assign(Stock=st)[['Stock','Open','Close']] for st in stocks ]) # read this line from the inside out
stocks_1day_df = stocks_df.reset_index()
stocks_1day_df = stocks_1day_df[stocks_1day_df.Date==enDate].reset_index().drop(columns=['Date','index'])

# Exercise: Create a dataframe of characters from your favorite movie

Create a dataframe from scratch of characters from your favorite movie

Columns could include:
- actor name
- character gender
- A boolean that is True if the character is a villain
- ... and any other column you'd like to add

Make the index of the dataframe the character name

In [0]:
# Create your dataframe here
dieHard = {"char_name":["John McClane", "Hans Gruber", "Holly Gennaro"],
          "actor_name":["Bruce Willis", "Alan Rickman", "Bonnie Bedelia"],
          "gender": ["Male", "Male", "Female"],
          "villain": [False, True, False]}

dieHard_df = pd.DataFrame(dieHard)
dieHard_df = dieHard_df.set_index('char_name')

# Exercise: Write your Movie Character dataframe out to a csv file

Using the movie character dataframe that you created earlier, write it out to the csv file "moviechars_df.csv".


In [0]:
dieHard_df.to_csv('./dieHard.csv')

Now, open the file by using Google Colab's interface, to verify that it makes wrote correctly and makes sense.

# Exercise: Make an Adjusted Rating Column, plot the difference

In the imdb data that we've been looking at, the rating is just the mean score. But some movies have many more votes than others, and this should lend more "weight" to their rating.  

A good adjusted scoring rule is:
```
 rating_adjusted = rating - (rating - 5)*2**(-log10(votes+1))
```
(note: to implement this, we can use `np.log10()` )

Make a new column called 'rating_adjusted' to implement this. Then, make a scatter plot of (rating_adjusted - rating) vs  rating.

note: it's okay if you want to make a column `rating_delta = rating_adjusted - rating`.

Now calculate the `rating_delta` and `rating_adjusted` and make a scatterplot with `rating` on the x-axis and `rating_delta` or `rating_adjusted` on the y-axis.

## Solution: Don't look at this until you've tried it! (You might have to do this on the final!)

In [0]:
# Write your code here
movies_df.head()

# Solution using apply:
movies_df['rating_adjusted'] = movies_df.apply(lambda row: row['rating']-(row['rating']-5)*2**(-np.log10(row['votes']+1)),axis=1)

# Solution w/o using apply (better):
movies_df['rating_adjusted'] = movies_df.rating - (movies_df.rating-5)*2**(-np.log10(movies_df.votes+1))
movies_df['rating_delta'] = movies_df.rating_adjusted - movies_df.rating
movies_df.plot(kind='scatter',x='rating',y='rating_delta',figsize=(20,10));

# Exercise: Plot of Revenue vs Adjusted Rating for only one Genre



Using the `movies_df` DataFrame, write a function that will plot the scatterplot of Revenue vs Metascore for only one Genre.

Your function should:
- have an input argument that is a string of the Genre, e.g., 'Horror'


Note that the an entry in the genre column contains a comma-separated list of different genres that a movie belongs to:

In [0]:
movies_df.genre[1:10]

However, we can get all the individual unique genres by using Pandas built in string operations on a series:

In [0]:
import numpy as np
allGenresConcatenated = movies_df.genre.str.cat(sep=',') # This will return a string by concatenating all the strings in each row of genre, separating them with a ',' 
allGenres=np.unique(allGenresConcatenated.split(',')) # This will split the string so that we have a list and then use numpy's unique() to get only the unique elements of the list
allGenres

You may find the following string method of dataframes useful:
- If a dataframe `df` has a string columm, `stringCol`, then the method
 - `df.stringCol.str.contains(someString)` will return `True` if someString is a substring within a value of stringCol.

Your goal is to define a function that will return a plot object. The function should make a scatter plot of `adjusted_rating` on the x-axis and `revenue_millions` on the y-axis.



In [0]:
# Define your function here
def plot_rev_vs_rating_adj(genreName):
  plot = # fill in your code here with plot = movies_df[SOMETHING].plot(SOMETHING)
  plot.set_title(genreName) 
  return plot

# Run your function for the genre's 'Horror' and 'Action'
plot = plot_rev_vs_rating_adj('Horror')
plot = plot_rev_vs_rating_adj('Action')

## Solution: Don't look at this until you've tried it!

In [0]:
def plot_rev_vs_rating_adj(genreName):
  plot = movies_df[movies_df.genre.str.contains(genreName)].plot(kind='scatter',x='rating_adjusted',y='revenue_millions')
  plot.set_title(genreName)
  return plot

plot=plot_rev_vs_rating_adj('Horror')
plot=plot_rev_vs_rating_adj('Action')

It would be nice if we could make a boxplot of the distribution of revenue across all genres in the same plot... we'll come back to this later, when we've learned some more tools to help us do this.

# Exercise: Cars, Cars, Cars ( but no motorcycles :[ )    

We'll test our knowledge of merging, and concatenating by working with some datasets on cars.

In [0]:
cars1 = pd.read_csv("https://raw.githubusercontent.com/dylanwalker/BA865/master/datasets/cars1.csv")
cars2_engine = pd.read_csv("https://raw.githubusercontent.com/dylanwalker/BA865/master/datasets/cars2_engine.csv")
cars2_perf = pd.read_csv("https://raw.githubusercontent.com/dylanwalker/BA865/master/datasets/cars2_perf.csv")
cars2_info = pd.read_csv("https://raw.githubusercontent.com/dylanwalker/BA865/master/datasets/cars2_info.csv")

display('cars1','cars2_engine','cars2_perf','cars2_info')

Q1: The first dataset `cars1` has some blank columns. Get rid of them:

In [0]:
# write your code here
cars1 = cars1.loc[:, "mpg":"car"] # using .loc to slice only the columns we want to keep

Q2: Look at the number of observations in each of the datasets (cars1, cars2_perf, cars2_engine, cars2_info).  Do any of the datasets contain duplicate data? If so, clean them.

In [0]:
# write your code here
cars1.drop_duplicates(inplace=True)
cars2_perf.drop_duplicates(inplace=True)
cars2_engine.drop_duplicates(inplace=True)
cars2_info.drop_duplicates(inplace=True)

Q3: Combine the data in cars2_engine, cars2_perf, and cars2_info into a single dataframe called cars2:

In [0]:
# write your code here
cars2=pd.merge(cars2_engine,cars2_perf)
cars2=pd.merge(cars2,cars2_info)
cars2

Q4: Get rid of the extra unnamed column in cars2 and then combine the data in cars1 and cars2 together.

In [0]:
# write your code here
cars2=cars2.loc[:,"car":"origin"] # get rid of the unnamed column in cars2
cars=pd.concat([cars1,cars2],sort=True) # concatenate the cars1 and cars2 dataframes together
cars

# Exercise: International Alcohol Consumption

In [0]:

drinks = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv')
drinks.head()

Q1: Which continent drinks more beer on average?

In [0]:
# code your answer here
drinks.groupby('country').beer_servings.mean().sort_values(ascending=False)

Q2: For each continent, print the statistics for wine consumption:

In [0]:
# code your answer here
drinks.groupby('continent').wine_servings.describe()

Q3: What is the mean alcohol consumption per contintent for every continent?

In [0]:
# code your answer here
drinks.groupby('continent').total_litres_of_pure_alcohol.mean()

Q4: Using only one line of code, compute the mean, min and max spirit consumption per continent.

In [0]:
# code your answer here
drinks.groupby('continent').spirit_servings.agg(['mean','min', 'max']).sort_values(['mean'],ascending=False) # the .sort_values() part is not necessary

# Exercise: Does the number of planets detected by each method change over the years?  

We want to know how the number of planets detected by each method changes over the years.
<br>
<br>
Want we're after is a dataframe where the index is a MultiIndex of (year,method) (where method is e.g., Radial Velocity, Pulsar Timing, etc.) and there is a column (that we'll name 'number') for the count for each year and method.
<br>
<br>
Once the dataframe is structure like this, we can just call `.plot()`.  if the figure is not sized correctly, try adding the keyword argument `figsize=(width,height)` and replace `width` and `height` by integers e.g. `figsize=(20,10)`.
<br>
<br>
Some hints:
- you'll need to group by more than one column to get the MultiIndex
- the order of the columns in the groupby matters, because we'll need to use `unstack()` to get the detection method's to become column labels.

In [0]:
# Enter your code here

## Solution: Don't look until you've tried it!

In [0]:
pmot_df=planets.groupby(['year','method'])['number'].count() # this will produce a multiIndexed df where index is (year, method) and one column: count
pmot_df.unstack().plot(); # This will plot a series for each column vs the index (year)

# Exercise: Make a boxplot of movie revenue by each genre



For this exercise, we'll use `movies_df`, containing imdb movie data.  What we'd like to do is make a boxplot with a box for each genre to show how the revenue is distributed for movies that are part of that genre.
<br>
<br>
Recall how the imdb movies data handles the fact that a movie could belong to one or more genres using a comma-separated string:


In [0]:
movies_df[['genre']]

To make our boxplot, we need to get this dataframe into "long" format.  If the genre field was a list instead of a string, we would be in a similar situation as with the pets example. So how can we get it to be come a list?  

Pandas has some cool methods that let you work with columns that are strings.

In fact, many are vectorized versions of Pythons regular string methods:


|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

But they also have methods for pattern matching with regular expressions:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |

I mentions these not only because they are very useful when working with text data, but also because we can use them for this exercise in order to take the genre, which is a string of comma-separated genre names, and convert it into a list.

Here's how:


In [0]:
movies_df.genre.str.split(',')

We can actually make this a new column of our dataframe. Lets call it "genre_list":

In [0]:
movies_df['genre_list']=movies_df.genre.str.split(',')
movies_df

Ok, now you're ready to go!  In order to make the boxplot, you'll need to:
- use `.explode()` to get the genre_list into long format.
- use `.boxplot()` with the appropriate `column` and `by` keyword arguments.
- I'd also suggest adding a `figsize=(40,8)` argument to `.boxplot()` because this figure will need to be fairly wide.

In [0]:
# Input your code here

## Solution -- Don't look until you've tried it!

In [0]:
# SOLUTION: 
mdf = movies_df.copy()
mdf['genreList']=movies_df.genre.apply(lambda x: x.split(',')) # make a genre column that is a list instead of a comma-separated string
mdf['genreList']=movies_df.genre.str.split(',') # this does the same thing with a string methbod

# Now we can use the 'explode' method which will take a list column and turn it into a bunch of rows
mdf[['revenue_millions','rating','genreList']].explode('genreList').boxplot(column='revenue_millions',by='genreList',figsize=(40,8));