# Exercise 1: Introduction to Pandas

In this exercise, we will work with the pandas library, which is one of the most important Python packages for data analysis and manipulation.  
The documentation of this package can be found here: https://pandas.pydata.org/docs/

In [1]:
import pandas as pd

## Examples: The Iris Dataset

We read in the Iris dataset that we obtained from https://archive.ics.uci.edu/ml/datasets/Iris and walk through a few basic examples.

In [2]:
# read data from a file into a data frame, specify column names by hand
df = pd.read_csv("iris.data", names = ["sepal_length", "sepal_width"," petal_weight", "petal_width", "class"])
df

Unnamed: 0,sepal_length,sepal_width,petal_weight,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


#### Accessing rows and columns

In [3]:
# columns can be accessed as attributes
df.sepal_length

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

In [4]:
## rows and columns can be acessed by index as well
# -> use loc to access columns by name
print(df.loc[:,"sepal_length"])

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64


In [5]:
# -> use iloc to access columns by numerical index
print(df.iloc[4:,1])

4      3.6
5      3.9
6      3.4
7      3.4
8      2.9
      ... 
145    3.0
146    2.5
147    3.0
148    3.4
149    3.0
Name: sepal_width, Length: 146, dtype: float64


In [6]:
#Training

In [7]:
print(df)

     sepal_length  sepal_width   petal_weight  petal_width           class
0             5.1          3.5            1.4          0.2     Iris-setosa
1             4.9          3.0            1.4          0.2     Iris-setosa
2             4.7          3.2            1.3          0.2     Iris-setosa
3             4.6          3.1            1.5          0.2     Iris-setosa
4             5.0          3.6            1.4          0.2     Iris-setosa
..            ...          ...            ...          ...             ...
145           6.7          3.0            5.2          2.3  Iris-virginica
146           6.3          2.5            5.0          1.9  Iris-virginica
147           6.5          3.0            5.2          2.0  Iris-virginica
148           6.2          3.4            5.4          2.3  Iris-virginica
149           5.9          3.0            5.1          1.8  Iris-virginica

[150 rows x 5 columns]


In [8]:
print(df.iloc[5:8,2])

5    1.7
6    1.4
7    1.5
Name:  petal_weight, dtype: float64


In [9]:
print(df.loc[5:8, "class"])

5    Iris-setosa
6    Iris-setosa
7    Iris-setosa
8    Iris-setosa
Name: class, dtype: object


In [10]:
df.loc[1]

sepal_length             4.9
sepal_width              3.0
 petal_weight            1.4
petal_width              0.2
class            Iris-setosa
Name: 1, dtype: object

#### Advanced selection and built-in functions

In [11]:
# get all rows where sepal length is bigger than 5

dg_filtered = df.loc[df["sepal_length"]>5]
dg_filtered
#df["sepal_length"]>5
#dg_filtered.loc[:5, "sepal_length"]

Unnamed: 0,sepal_length,sepal_width,petal_weight,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
10,5.4,3.7,1.5,0.2,Iris-setosa
14,5.8,4.0,1.2,0.2,Iris-setosa
15,5.7,4.4,1.5,0.4,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [12]:
df["sepal_length"]>5

0       True
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148     True
149     True
Name: sepal_length, Length: 150, dtype: bool

In [13]:
# get mean value of petal width
print(df.sepal_length.mean())

5.843333333333335


In [14]:
# get unique class values
print(df["class"].unique())

['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']


## Task 1: Exploring Census Data

In this task we work with the adult dataset, which has been axtracted from a 1994 census dataset.  
A brief documentation can be found here: https://archive.ics.uci.edu/ml/datasets/adult

__a)__ Read in the "adult.csv" file and print its ```head()``` to get a little overview of it. Note that this dataset contains NAs which are encoded as '?' and should be converted accordingly. How many rows and colums does this dataset have?

In [15]:
df = pd.read_csv("adult.csv", na_values="?")
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39.0,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64.0,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38.0,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44.0,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [16]:
df.loc[:,"hours-per-week"]

0        40
1        13
2        40
3        40
4        40
         ..
48837    36
48838    40
48839    50
48840    40
48841    60
Name: hours-per-week, Length: 48842, dtype: int64

__b)__ Compute the mean 'working time per week'!

In [17]:
mean_working_time = df.loc[:,"hours-per-week"].mean()
mean_working_time

40.422382375824085

__c)__ Give the unique values that occur in the attribute 'education'. Further, give the number of people in the dataset that have obtained each specific education level!

In [18]:
unique_values = df.loc[:,"education"].unique()
print(unique_values)

#df_edu = df.loc[:,"education"]
#print(df_edu)
#df_edu.iloc[2]

#print(df.loc[df["education"]=="Bachelors", "education"].count())


for item in unique_values:
    number = df.loc[df["education"]==item, "education"].count()
    print(item + " :" + str(number))


['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 '1st-4th' 'Preschool' '12th' 'HS-jupytgrad']
Bachelors :8025
HS-grad :15783
11th :1812
Masters :2657
9th :756
Some-college :10878
Assoc-acdm :1601
Assoc-voc :2061
7th-8th :955
Doctorate :594
Prof-school :834
5th-6th :509
10th :1389
1st-4th :247
Preschool :83
12th :657
HS-jupytgrad :1


__c)__ List all persons with a Bachelor degree as their highest degree, sorted by their ```capital-loss``` in descending order. What is the sum of ```capital-loss``` for these persons?

In [19]:
df.loc[df["education"]=="Bachelors"].sort_values(by=['capital-loss'],ascending=False)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
41864,52.0,Private,106176,Bachelors,13,Divorced,Adm-clerical,Unmarried,White,Male,0,3770,40,United-States,<=50K
33743,59.0,Private,157749,Bachelors,13,Widowed,Exec-managerial,Unmarried,White,Male,0,3004,40,United-States,>50K
29790,37.0,Private,188774,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Male,0,2824,40,United-States,>50K
26797,34.0,Private,203034,Bachelors,13,Separated,Sales,Not-in-family,White,Male,0,2824,50,United-States,>50K
45635,51.0,Self-emp-inc,200046,Bachelors,13,Separated,Sales,Unmarried,White,Male,0,2824,40,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16698,38.0,State-gov,143517,Bachelors,13,Never-married,Exec-managerial,Own-child,White,Male,0,0,40,United-States,<=50K
16689,31.0,Local-gov,47276,Bachelors,13,Married-civ-spouse,Other-service,Husband,White,Male,0,0,38,United-States,>50K
16682,78.0,Self-emp-inc,385242,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,9386,0,45,United-States,>50K
16663,35.0,Private,140564,Bachelors,13,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,>50K


__d)__ How many males have a bachelor degree as their highest degree?

In [20]:
df_male = df.loc[(df["sex"]=="Male") & (df["education"]=="Bachelors")]
len(df_male)


5548

__e)__ List the 10 youngest persons with a bachelor degree or higher. _Hint: consider the_ ```education-num``` _attribute_.

In [21]:
df_young_and_educated = df.loc[(df["education-num"]>=13)].sort_values(by=['age'],ascending=True).head(10)
len(df_young_and_educated)
df_young_and_educated

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
12183,18.0,Local-gov,155905,Masters,14,Never-married,Prof-specialty,Own-child,White,Female,0,0,60,United-States,<=50K
3591,19.0,Private,100999,Bachelors,13,Never-married,Prof-specialty,Own-child,White,Female,0,0,30,United-States,<=50K
1570,19.0,,62534,Bachelors,13,Never-married,,Own-child,Black,Female,0,0,40,Jamaica,<=50K
31052,20.0,Private,190227,Masters,14,Never-married,Exec-managerial,Own-child,White,Male,0,0,25,United-States,<=50K
8415,20.0,Private,216436,Bachelors,13,Never-married,Sales,Other-relative,Black,Female,0,0,30,United-States,<=50K
8923,21.0,Private,182823,Bachelors,13,Never-married,Sales,Own-child,White,Male,0,0,30,United-States,<=50K
14436,21.0,Private,162667,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,40,Columbia,<=50K
37321,21.0,Private,224632,Bachelors,13,Never-married,Adm-clerical,Own-child,Black,Female,0,0,38,United-States,<=50K
34144,21.0,Private,238899,Bachelors,13,Never-married,Sales,Own-child,Black,Female,0,0,30,United-States,<=50K
3579,21.0,,180303,Bachelors,13,Never-married,,Not-in-family,Asian-Pac-Islander,Male,0,0,25,,<=50K


__f)__ Show for each combination of sex and race, how many instances (people) are contained in the dataset.  _Hint: consider panda's_ ```groupby()``` _function in this as well as in the following subtasks_.

In [113]:
df.groupby(by=["sex", "race"]).size().unstack()

race,Amer-Indian-Eskimo,Asian-Pac-Islander,Black,Other,White
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,185,517,2308,155,13027
Male,285,1002,2377,251,28735


__g)__ What is the mean age of men and women in this dataset?

In [114]:
(df.loc[:, ["sex", "age"]]).groupby(by="sex").mean()

Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
Female,36.927989
Male,39.49441


__h)__ Show for each combination of marital-Status and race how many males/females over 40 years have a bachelor degree as their highest degree?

In [28]:
df.loc[(df["age"]>40) & (df["education"]=="Bachelors")].groupby(by=["marital-status", "race", "sex"]).size().unstack().fillna(0)

Unnamed: 0_level_0,sex,Female,Male
marital-status,race,Unnamed: 2_level_1,Unnamed: 3_level_1
Divorced,Amer-Indian-Eskimo,0.0,2.0
Divorced,Asian-Pac-Islander,7.0,4.0
Divorced,Black,34.0,20.0
Divorced,Other,2.0,2.0
Divorced,White,253.0,215.0
Married-civ-spouse,Amer-Indian-Eskimo,2.0,3.0
Married-civ-spouse,Asian-Pac-Islander,17.0,92.0
Married-civ-spouse,Black,16.0,70.0
Married-civ-spouse,Other,0.0,6.0
Married-civ-spouse,White,131.0,1852.0


## Task 2: Organizing a Book and Movie shop

For a virtual shop that sells movies and books, we have four tables:
* ```pd_customers```: Gives first- and lastname for each customer
* ```pd_books```: Gives the raw price for all books that are being sold
* ```pd_movies```: Gives the raw price for all movies that are being sold
* ```pd_transactions```: Gives the list of all transactions being made (which customer bought which item)

__a)__ Load all 4 datasets in separate dataframes!

In [222]:
df_pd_customers = pd.read_csv("pd_customers.csv", na_values="?")
df_pd_books = pd.read_csv("pd_books.csv", na_values="?")
df_pd_movies = pd.read_csv("pd_movies.csv", na_values="?")
df_pd_transactions = pd.read_csv("pd_transactions.csv", na_values="?")

In [223]:
df_pd_customers

Unnamed: 0,cust_id,first_name,last_name
0,1,Max,Mustemann
1,2,Ben,Mayer
2,3,Sarah,Mueller
3,4,Tina,Berger
4,5,Donald,T.
5,6,Miriam,Faber
6,7,Thomas,Hase
7,8,Fabian,Engelbert
8,9,Hans,Kleber
9,10,Brigitte,Jefferson


In [224]:
df_pd_books

Unnamed: 0,book,price
0,Book 1,9.99
1,Book 2,8.99
2,Book 3,29.99
3,Book 4,8.49
4,Book 5,15.99
5,Book 6,12.19
6,Book 7,13.99
7,Book 8,49.99
8,Book 9,125.99
9,Book 10,8.99


In [225]:
df_pd_movies

Unnamed: 0,movie,price
0,Movie 1,15.99
1,Movie 2,22.99
2,Movie 3,15.99
3,Movie 4,14.49
4,Movie 5,3.99
5,Movie 6,2.19
6,Movie 7,11.99
7,Movie 8,31.99
8,Movie 9,35.99
9,Movie 10,1.99


In [226]:
df_pd_transactions

Unnamed: 0,cust_id,item
0,1,Book 1
1,1,Movie 1
2,1,Movie 5
3,1,Book 9
4,2,Book 1
5,2,Book 8
6,2,Book 10
7,3,Movie 1
8,3,Movie 5
9,3,Movie 8


__b)__  Compile a listing of all items (i.e., books and movies) that have been sold in one of the dataframes.
The resulting dataframe should contain two columns: ```"item_name"``` and ```"price"```. _Hint: consider panda's_ ```concat()``` _function_.

In [227]:
df_pd_books.rename(columns={"book":"item"}, inplace = True)
df_pd_movies.rename(columns={"movie":"item"}, inplace = True)

In [228]:
df_pd_books

Unnamed: 0,item,price
0,Book 1,9.99
1,Book 2,8.99
2,Book 3,29.99
3,Book 4,8.49
4,Book 5,15.99
5,Book 6,12.19
6,Book 7,13.99
7,Book 8,49.99
8,Book 9,125.99
9,Book 10,8.99


In [229]:
df_pd_movies

Unnamed: 0,item,price
0,Movie 1,15.99
1,Movie 2,22.99
2,Movie 3,15.99
3,Movie 4,14.49
4,Movie 5,3.99
5,Movie 6,2.19
6,Movie 7,11.99
7,Movie 8,31.99
8,Movie 9,35.99
9,Movie 10,1.99


In [230]:
df_trans_movies = pd.merge(df_pd_transactions, df_pd_movies, on='item', how='inner')
df_trans_books = pd.merge(df_pd_transactions, df_pd_books, on="item", how="inner" )

In [231]:
df_trans_books

Unnamed: 0,cust_id,item,price
0,1,Book 1,9.99
1,2,Book 1,9.99
2,4,Book 1,9.99
3,5,Book 1,9.99
4,1,Book 9,125.99
5,3,Book 9,125.99
6,4,Book 9,125.99
7,5,Book 9,125.99
8,5,Book 9,125.99
9,9,Book 9,125.99


In [232]:
df_trans_movies

Unnamed: 0,cust_id,item,price
0,1,Movie 1,15.99
1,3,Movie 1,15.99
2,4,Movie 1,15.99
3,5,Movie 1,15.99
4,9,Movie 1,15.99
5,1,Movie 5,3.99
6,3,Movie 5,3.99
7,4,Movie 5,3.99
8,5,Movie 5,3.99
9,9,Movie 5,3.99


In [233]:
df = pd.concat([df_trans_books, df_trans_movies], ignore_index=True)
df = df.loc[:, ["item", "price"]]
df.rename(columns={"item":"item_name"}, inplace = True)
df.drop_duplicates(['item_name','price'],keep= 'last')

Unnamed: 0,item_name,price
3,Book 1,9.99
9,Book 9,125.99
12,Book 8,49.99
18,Book 10,8.99
20,Book 7,13.99
21,Book 2,8.99
26,Movie 1,15.99
31,Movie 5,3.99
36,Movie 8,31.99
38,Movie 3,15.99


__c)__ Join the information on customer names, transactions, and prices into a single dataframe. _Hint: consider panda's_ ```merge()``` _function_.

In [234]:
df_cust_trans = pd.merge(df_pd_customers, df_pd_transactions, on='cust_id', how='inner')

df_movies_and_books = pd.concat([df_pd_movies, df_pd_books])
df_movies_and_books

df_cust_trans_movies_books = pd.merge(df_cust_trans, df_movies_and_books, on="item", how="inner" )
#df_cust_trans_movies_books[["first_name", "last_name", "item", "price"]]
df_cust_trans_movies_books

Unnamed: 0,cust_id,first_name,last_name,item,price
0,1,Max,Mustemann,Book 1,9.99
1,2,Ben,Mayer,Book 1,9.99
2,4,Tina,Berger,Book 1,9.99
3,5,Donald,T.,Book 1,9.99
4,1,Max,Mustemann,Movie 1,15.99
5,3,Sarah,Mueller,Movie 1,15.99
6,4,Tina,Berger,Movie 1,15.99
7,5,Donald,T.,Movie 1,15.99
8,9,Hans,Kleber,Movie 1,15.99
9,1,Max,Mustemann,Movie 5,3.99


__d)__ Compute a table of customers. For all customers give the number of items bought, the total price of these items, and the average price of these items.

In [235]:
df_cust_trans_movies_books = df_cust_trans_movies_books.groupby (["cust_id", "first_name", "last_name"]).agg(["count","sum","mean"])
df_cust_trans_movies_books.reset_index(level=[1,2], inplace=True)
df_cust_trans_movies_books.columns = ['_'.join(col).strip() for col in df_cust_trans_movies_books.columns.values]
df_cust_trans_movies_books

Unnamed: 0_level_0,first_name_,last_name_,price_count,price_sum,price_mean
cust_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Max,Mustemann,4,155.96,38.99
2,Ben,Mayer,3,68.97,22.99
3,Sarah,Mueller,4,177.96,44.49
4,Tina,Berger,8,226.92,28.365
5,Donald,T.,9,381.91,42.434444
6,Miriam,Faber,2,45.98,22.99
8,Fabian,Engelbert,1,15.99,15.99
9,Hans,Kleber,7,245.93,35.132857
10,Brigitte,Jefferson,2,31.98,15.99


In [236]:
#df1 = df_cust_trans_movies_books.groupby(by=["cust_id", "first_name", "last_name"]).sum()
#df2 = df_cust_trans_movies_books.groupby(by=["cust_id", "first_name", "last_name"]).mean()
#df3 = df_cust_trans_movies_books.groupby(by=["cust_id", "first_name", "last_name"]).count()


__e)__  Round the average price to two digits and export the resulting table to a csv-file!

In [267]:
df_cust_trans_movies_books.round({'price_mean': 2})
df_cust_trans_movies_books.to_csv("df_cust_trans_movies_books.csv")

__f)__ Compute lists of the top 10 bestselling items, both by count and by sum of prices

In [268]:
df_movies_and_books = pd.concat([df_pd_movies, df_pd_books], ignore_index = True)
df_movies_and_books

df = pd.merge(df_pd_transactions, df_movies_and_books)
df
df.loc[:, ["item", "price"]].groupby(by="item").sum().sort_values(by="price", ascending=False)


Unnamed: 0_level_0,price
item,Unnamed: 1_level_1
Book 9,755.94
Movie 8,159.95
Book 8,149.97
Movie 1,79.95
Book 10,53.94
Book 1,39.96
Movie 3,31.98
Book 7,27.98
Movie 2,22.99
Movie 5,19.95


In [270]:
df2 = df.loc[:, ["item", "price"]].groupby(by="item").count().sort_values(by="price", ascending=False)
df2.rename(columns={"price" : "count"}, inplace=True)
df2

Unnamed: 0_level_0,count
item,Unnamed: 1_level_1
Book 10,6
Book 9,6
Movie 1,5
Movie 5,5
Movie 8,5
Book 1,4
Book 8,3
Book 7,2
Movie 3,2
Book 2,1
