### GESIS Fall Seminar in Computational Social Science 2022
### Introduction to Computational Social Science with Python
# Day 4-2: Manipulating `pandas` DataFrames

## Overview

* Handling different data types
* Combining data from different tables
* Applying functions to DataFrames
* Creating basic plots using `pandas`

## Handling different data types
* Each column in a pandas DataFrame (or Series) has an assigned data type (`str`, `int`, `float` etc).
* This is usually automatically assigned, but you are able to change column data types.
* It is possible (though not usually advised) to mix different types of object in the same column. In this case the column data type is "`object`".

### Reading dtypes

In [None]:
import pandas as pd

songsdf = pd.read_csv('data/songs.csv', index_col=0)
wikiviewsdf = pd.read_hdf('data/pageviews_2022.h5')

display(songsdf)
display(wikiviewsdf)

In [None]:
songsdf.info()
print(songsdf.dtypes)

wikiviewsdf.info()
print(wikiviewsdf.dtypes)

### Converting dtypes

In [None]:
songsdf['peak_rank'] = songsdf['peak_rank'].astype(float)

In [None]:
display(wikiviewsdf['date'])
display(pd.to_datetime(wikiviewsdf['date']))

wikiviewsdf['date'] = pd.to_datetime(wikiviewsdf['date'], format='%Y%m%d')

In [None]:
display(wikiviewsdf.isnull())

display(wikiviewsdf.isnull().sum().sort_values().tail(20))

In [None]:
display(wikiviewsdf['Stephen Sanchez'].dropna())

wikiviewsdf = wikiviewsdf.fillna(0)
display(wikiviewsdf)

### Operating on columns
* We can use many regular Python mathematical, string, date, etc. operations to manipulate data in columns

In [None]:
songsdf.loc[1]

In [None]:
# Sum the views of the artists who sang the top song

wikiviewsdf['Bizarrap'] + wikiviewsdf['Quevedo']

In [None]:
# Average views of the artists who sang the top song

(wikiviewsdf['Bizarrap'] + wikiviewsdf['Quevedo'])/2

In [None]:
# String operations
# Call string-specific methods with ".str"

display(songsdf['artist_names'] + '!!!')
display(songsdf['track_name'].str.upper())
display(songsdf['track_name'].str.replace(' ', '_'))
display(songsdf['track_name'].str.split())

In [None]:
# Date operations
# Call date-specific methods with ".dt"

import datetime

display(wikiviewsdf['date'] - datetime.datetime(2022, 1, 1))

display(wikiviewsdf['date'].dt.dayofweek)

In [None]:
# Let's set the date column as the index
wikiviewsdf = wikiviewsdf.set_index('date')
display(wikiviewsdf)

In [None]:
# use 'resample' with 'sum' to get pageviews every 2 Days
display(wikiviewsdf.resample('2D').sum())

## 🏋️‍♀️ PRACTICE

In [None]:
# Q1:
# a) Convert the peak rank column in songsdf back to int type
# b) In the wikiviewsdf some columns are float type (an artefact of the NaNs we removed), convert these to int type


In [None]:
# Q2: Resample the wikiviewsdf to show the weekly page views to each artist


In [None]:
# Q3: Find the max and median page views for each artist.
# Return a Series with the relative difference between them.


## Combining data from different tables
* Often we would like to combine information from different sources into a single DataFrame.
* Several different ways to do this:

![append](figs/08_concat_row.svg "append")

In [None]:
wikiviewsdf_2021 = pd.read_hdf('data/pageviews_2021.h5')
wikiviewsdf_2022 = pd.read_hdf('data/pageviews_2022.h5')

In [None]:
# Use append to append one df to another

wikiviewsdf_2021.append(wikiviewsdf_2022)

In [None]:
# Use concat to concatenate multiple DataFrames together

pd.concat([wikiviewsdf_2021, wikiviewsdf_2022])

In [None]:
# Let's load some more Spotify data about the songs

songsdf_data = pd.read_csv('data/songs_data.csv', index_col=0)
display(songsdf_data)

### Merging dataframes

![merge](figs/08_merge_left.svg "merge")

In [None]:
# Merge DataFrames
# Merge is a very flexible function, see the documentation for the variations

songsdf = songsdf.merge(songsdf_data, on=['artist_names', 'track_name'])
display(songsdf)

## 🏋️‍♀️ PRACTICE

In [None]:
# Q4: Using a for loop with 'append()', combine the yearly page views 2017-2022 DataFrames into a single DataFrame.


In [None]:
# Q5: Repeat the above without a loop, but with a single 'concat()' operation.


In [None]:
# Q6:
# a) Sum each artist's total page views over the year 2022 to get a Series.
# b) Create a column in songsdf of 'lead_artist' (the first listed artist in artist_names)
# c) Combine the total page view Series with the songsdf based on the lead_artist column.


## Applying functions to DataFrames

### Frequently used methods
* Non-exhaustive, many more useful operations built-in to `pandas`.

In [None]:
# Reset the index to 0,1,2,3... (and optionally keep the index as a column)
display(wikiviewsdf.reset_index())

# Set an existing column as the index
display(songsdf.set_index('track_name'))

In [None]:
# Drop duplicate entries
display(songsdf.drop_duplicates(subset=['artist_names']))

In [None]:
# Transpose the DataFrame (flip rows/columns)
display(songsdf.T)

In [None]:
# Replace values according to a dictionary
mapdict = {"Rimas Entertainment LLC":"Rimas Entertainment"}
display(songsdf.replace(mapdict))

### .apply()
* What if the operation we want to do is not built-in to `pandas`?
* Use `.apply()` to apply any function across rows or columns!

In [None]:
# Use a lambda function for relatively simple operations

# Applied to a DataFrame (note axis=1)
display(songsdf.apply(lambda x: x['streams'] / len(x['artist_names'].split(', ')), axis = 1))

# The function is equivalent to:
def somefunction1(x):
    return x['streams'] / len(x['artist_names'].split(', '))

# Applied to a Series
display(songsdf['artist_names'].apply(lambda x: ''.join([y for y in x if y.lower() not in 'aeiou'])))

# The function is equivalent to:
def somefunction2(x):
    return ''.join([y for y in x if y.lower() not in 'aeiou'])

In [None]:
# Separately define a function for more complex procedures.

# Repeating the above:

def streams_per_artist(x):
    return x['streams'] / len(x['artist_names'].split(', '))

display(songsdf.apply(streams_per_artist, axis=1))

# or something more complicated:

def fake_fan(x):
    lead_artist = x['artist_names'].split(', ')[0]
    if x['weeks_on_chart'] < 3:
        return "I love the super new song, %s by %s." %(x['track_name'], lead_artist)
    elif x['weeks_on_chart'] < 52:
        return "%s by %s is my favourite song of the year." %(x['track_name'], lead_artist)
    else:
        return "%s is an all-time classic song. %s is a legend." %(x['track_name'], lead_artist)

display(songsdf.apply(fake_fan, axis=1))

## 🏋️‍♀️ PRACTICE

In [None]:
# Q7: Use a lambda function applied to songsdf to create a column with the phrase:
# "__song name__ by __lead artist name__"



In [None]:
# Q8: Write a function that takes rank, song title, artist names, and peak rank to output the phrase:
#
# "At number __rank__, down from number __peak rank__ [/at its peak], is __song title__
# by __artist name1__ [and _artist name2_...]."
#
# Apply this to the DataFrame to create a new column.


## Creating basic plots using `pandas`
* We can quickly plot data in DataFrames with the built-in plotting metods.
* `pandas` integrates with the standard `matplotlib` library (which we will spend more time on tomorrow).

![plot](figs/04_plot_overview.svg "plot")

In [None]:
import matplotlib.pyplot as plt

# Histogram
songsdf['streams'].plot.hist()
plt.show()

# We can customise elements of the plot
songsdf['streams'].plot.hist(title='Streams Histogram', color='r', bins=20)
plt.show()

In [None]:
# Line plot
wikiviewsdf.iloc[:,:5].plot(title='Artist Wikipedia Page Views', ylabel='Page Views', logy=True)
plt.show()

# Scatter plot
songsdf.plot.scatter(x='streams', y='peak_rank', title='Scatter Plot', logx=True, s=50, alpha=0.5)
plt.show()

## 🏋️‍♀️ PRACTICE

In [None]:
# Q9: Plot the top 5 most popular artists' Wikipedia page views over the year 2022 on a single figure
# Ensure the plot is appropriately labelled and make any further customisations as you wish


In [None]:
# Q10: Create a scatter plot for total artist streams vs Wikipedia page views (requires correct Q6)
# Ensure the plot is appropriately labelled and make any further customisations as you wish
