# Phase 1 Code Challenge Review

![let's do this](https://media.giphy.com/media/12WPxqBJAwOuIM/giphy.gif)

The topics covered will be:

  - [Interacting with Pandas dataframes](#dataframes)
  - [Visualization](#viz)
  - [Python Data Structures](#datastructures)
    

In [22]:
# My function to call on y'all
from src.call import call_on_students

ModuleNotFoundError: No module named 'src'

<a id='dataframes'></a>
# Part 1: Interacting with Pandas DataFrames

To practice working with dataframes, we will use some Facebook data taken from the UCI Machine Learning repository.

Refer to this paper if you are interested in learning more. There is also a nice description of the features: http://www.math-evry.cnrs.fr/_media/members/aguilloux/enseignements/m1mint/moro2016.pdf



In [None]:
# Before anything else - need to import pandas!
import pandas as pd

## Task 1: Read in the data

In [None]:
# Can explore the data folder within your notebook
!ls data

Read 'dataset_Facebook.csv' from the data foldeer into the notebook as a Pandas dataframe.

Note: we'll need to set a different delimiter here - let's read the data in, then explain what that means.

In [None]:
call_on_students(1)

In [None]:
# Your code here
facebook = pd.read_csv('data/dataset_Facebook.csv', delimiter=';')
facebook.head()

## Task 2: Explore the data

### 2a: Look at the first five rows of the dataframe, then the last ten rows

In [23]:
call_on_students(1)

NameError: name 'call_on_students' is not defined

In [None]:
# Your code here
facebook.head()

In [None]:
facebook.tail(10)

### 2b: Look at the information for each column in the dataframe, then describe what you notice

In [None]:
call_on_students(1)

In [None]:
# Your code here
facebook.info()

What do you notice?

- 500 rows, entries
- 19 columns (0th python index)
- very small amount of null values
- we have one column that is object (string)


### 2c: Describe the dataframe's numeric columns, then describe what you notice
 

In [None]:
call_on_students(1)

In [None]:
# Your code here
facebook.describe()

In [None]:
facebook['Type'].value_counts()

In [None]:
facebook.describe(include=['object'])

What do you notice?

- what is category number?
- weekday could be name
- paid is boolean (1 or 0)
- very different numerical scales (relevant later)
- we have extreme values (max way higher than 75%)

## Task 3: Explore null values

### 3a: Count how many null values there are in each column
 

In [None]:
call_on_students(1)

In [None]:
# Your code here
facebook.isna().sum()

### 3b: What are some of the things we could do to deal with null values?
 

In [None]:
call_on_students(1)

Answer:

- drop nulls, depending on how many
    -dropping the rows that have nulls
    -drop a column has nulls
- missing indicator with boolean (yes, no)
- fill in with some measure of central tendency
- fill in with new category ('Unknown' or 'N/A')

### 3c: Drop records that have null values in the `share` column

In [None]:
call_on_students(1)

In [None]:
# Your code here
facebook.dropna(subset=['share'], inplace=True)

# Other way
# facebook = facebook.dropna(subset=['share'])

In [None]:
facebook.isna().sum()

## Task 4: Create a column

An "impression" counts each time a post is displayed.  

Create a new column called `likes_per_impression` which divides the number of likes per post by the number of impressions per post.

In [None]:
call_on_students(1)

In [None]:
# Your code here
facebook["likes_per_impression"] = facebook['like'] / facebook['Lifetime Post Total Impressions']
facebook.head()

## Task 5: Multiply `likes_per_impression` by 100 so they look like percentages

Make a new column for this, `likes_per_impression_perc`, to capture this output.

In [None]:
call_on_students(1)

In [None]:
# Your code here
facebook['likes_per_impression_perc'] = facebook['likes_per_impression'] * 100
facebook.head()

## Task 6: How many examples of each type of post?

I'm seeing a lot of posts with the type of 'Photo' - are there other post types, and what's the breakdown?

In [None]:
call_on_students(1)

In [None]:
# Your code here
facebook['Type'].value_counts()

In [None]:
# Can also look at the percentages
facebook['Type'].value_counts(normalize=True)

In [None]:
facebook['Category'].value_counts()

In [None]:
facebook.groupby(by='Type').count()

## Task 7: Find the five Photos with the highest likes/impression ratio

Find the five **Photos** with the highest number in our new `likes_per_impression_perc` column

In [None]:
call_on_students(1)

In [None]:
# Your code here
top_5 = facebook.loc[facebook['Type'] == "Photo"].sort_values(by='likes_per_impression_perc',
                                                      ascending=False).head()

In [None]:
top_5

In [None]:
# key error, can't find 
facebook['Daniel']

## Task 8: Find the most liked Photo

Locate the **Photo** that has the largest amount of likes in the `like` column

In [None]:
call_on_students(1)

In [None]:
# Your code here
top_5_likes = facebook.loc[facebook['Type'] == "Photo"].sort_values(by='like',
                                                      ascending=False).head(1)

In [None]:
top_5_likes

In [None]:
facebook.loc[(facebook['Type'] == "Photo") & (facebook['like'] == facebook['like'].max())]

In [None]:
photos = facebook.loc[facebook['Type'] == "Photo"]

In [None]:
photos.loc[photos['like'].idxmax()]

## Task 9: Find the average

What is the mean number of Total Interactions for **Photos**?

In [None]:
call_on_students(1)

In [None]:
# Your code here
facebook.loc[facebook['Type'] == "Photo"]['Total Interactions'].mean()

In [None]:
photos['Total Interactions'].describe()[1]

<a id='viz'></a>
# Part 2: Visualization

In [None]:
# Need more imports!
import matplotlib.pyplot as plt

## Task 10: Bar Chart

Create a bar chart showing the number of posts per month.

Order the x-axis by month as they appear on the calendar.

Don't forget to add labels and a title.  

Use the `plt.subplot` method if you can, but if you can't, resort to the `plt` syntax.

In [None]:
facebook.head()

In [None]:
call_on_students(1)

In [None]:
facebook['Post Month'].value_counts().sort_index().values

In [None]:
# Your code here
# First need to access the number of posts per month
x = facebook['Post Month'].value_counts().index
height = facebook['Post Month'].value_counts().values

In [None]:
height

In [None]:
x

In [None]:
# Your code here
# Now need to visualize
fig, ax = plt.subplots()
ax.bar(x=x, height=height)
plt.title('Number of Posts per Month')
ax.set_xlabel('Post Month')
ax.set_ylabel('Number of Posts')
ax.set_xticks(range(1,13))
ax.set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr',
                    'May', 'Jun', 'July', 'Aug',
                   'Sep', 'Oct', 'Nov', 'Dec'])
plt.xticks(rotation=45);

Let's discuss: what do you notice?

- less posts at the beginning at year
- spikes for vacas


## Task 11: Scatter Plot

Create a scatter plot that shows the correlation between total interactions and likes.

In [24]:
call_on_students(1)

NameError: name 'call_on_students' is not defined

In [None]:
# Your code here
fig, ax = plt.subplots()
ax.scatter(x=facebook['Total Interactions'], y=facebook['like'])
ax.set_title('Like/Interaction Ratio')
ax.set_xlabel('Total Interactions')
ax.set_ylabel('Likes');

In [None]:
facebook.plot.scatter('Total Interactions', 'like');

Let's discuss: what do you notice?

- Outlier!!! who dat
- very strong positive correlation


<a id='datastructures'></a>
# Part 3: Data Structures

For this next section, we will explore a nested dictionary that comes from the Spotify API.  You won't need to do these kinds of imports and opens on the code challenge, so I'll go ahead and share the code to get started here.

The `data` variable below contains 6 separate pings, each of which returns a list of the top 20 songs streamed on a given day.


In [None]:
# First we need some imports
import json
import pickle

In [None]:
# Let's open the pickled file
with open('data/offset_newreleases.p','rb') as read_file:
    responses = pickle.load(read_file)

In [None]:
data = [json.loads(r) for r in responses]

In [None]:
# Sanity check - we said there would be 6 pings, are there?
len(data)

We will work only with the first response.

In [None]:
first_response = data[0]

In [None]:
first_response

## Task 12: Navigate the dictionary

Explore the `first_response` dictionary and find how to access the items list, which contains the details about the twenty songs. Assign the list to the variable `first_twenty_songs`.

Hint: print out the keys at each level with .keys().

In [None]:
call_on_students(1)

In [None]:
first_response.keys()

In [None]:
first_response['albums'].keys()

In [None]:
first_response['albums']['items']

In [None]:
len(first_response['albums']['items'])

In [None]:
# Your code here
first_twenty_songs = first_response['albums']['items']

## Task 13: Loop to List

Create a list of **track names** of all twenty songs using a for loop or list comprehension.

In [None]:
call_on_students(1)

In [None]:
first_twenty_songs[0]

In [None]:
first_twenty_songs[0]['name']

In [None]:
# Your code here
track_names = [record['name'] for record in first_twenty_songs]
track_names

In [None]:
track_name_2 = []
for record in first_twenty_songs:
    track_name_2.append(record['name'])
track_name_2

In [None]:
col = first_response['albums']['items']
df = pd.DataFrame(col)
df_list = list(df['name'])
df_list

## Task 14: Create a new dictionary

Create a dictionary called `song_dictionary` which consists of each track name `string` as a key and a `tuple` of artists associated with each track as a value.

In [None]:
first_twenty_songs[0].keys()

In [None]:
first_twenty_songs[0]['artists']

In [None]:
call_on_students(1)

In [None]:
artist_list = []
for song in first_twenty_songs:
    song_artists = []
    for artist in song['artists']:
        song_artists.append(artist['name'])
    artist_list.append(tuple(song_artists))
        

In [None]:
song_dictionary2 = {song['name']: tuple([artist['name'] for artist in song['artists']]) for song in first_twenty_songs}   

In [None]:
artist_list

In [None]:
song_dict = dict(zip(track_names, artist_list))
song_dict

In [None]:
first_twenty_songs[0]['artists']

In [25]:
# Your code here:
song_dictionary = {}
for song in first_twenty_songs:
    artist_list = []
    for artist in song['artists']:
        artist_list.append(artist['name'])
    song_dictionary[song['name']] = tuple(artist_list)

NameError: name 'first_twenty_songs' is not defined

In [None]:
song_dictionary

## Task 15: Write a function

Create a function with takes an **artist name** and the **song_dictionary** as arguments, and returns a `list` of songs written by that artist. 

In [None]:
# call_on_students(1)

In [None]:
song_dictionary['Over Now (with The Weeknd)']

In [None]:
# Easier to do things outside of a function first
# Let's try for The Weeknd
song_list = []

for song in song_dictionary:
    if "The Weeknd" in song_dictionary[song]:
        song_list.append(song)
song_list

In [None]:
# Your code here

def find_song_by_artist(artist_name, song_dict):
    
    '''
    Parameters:
    arist_name: a string of an artist's name to be used to search the dictionary
    song_dict:  a dictionary of top_twenty songs with song name as keys and a list of 
    artist names as values
    
    Returns:
    A list of songs which the given artist appeared on
    '''
    song_list = []
    for song in song_dict:
        if artist_name in song_dict[song]:
            song_list.append(song)
    return song_list
    

In [None]:
# Test the function:
find_song_by_artist('Big Sean', song_dictionary)

In [None]:
def find_song_by_artist2(artist_name, song_dict):
    return list({k: v for k, v in song_dict.items() if artist_name in v}.keys())

In [None]:
find_song_by_artist2('Big Sean', song_dictionary)

In [None]:
list({k: v for k, v in song_dict.items() if artist_name in v}.keys()) 