# Data Cleaning Excercise

## Task: 

Have a look at the following dataset: dsm-beuth-edl-demodata-orig

Write a Python / Panda Script which 'cleans' this data set. Justify your actions in the respective notebook or python script you provide as a solution (link, file, kaggle repo, etc.).

The original dataset does not necessarily have to be created. A proper strategy / good arguments are more important. Value: 5 points.

## Solution

In [57]:
import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/edlich/eternalrepo/master/DS-WAHLFACH/dsm-beuth-edl-demodata-dirty.csv"
demodata = pd.read_csv(url)

### 1. Asses data (dimensions, types etc.)

In [58]:
demodata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          20 non-null     float64
 1   full_name   21 non-null     object 
 2   first_name  21 non-null     object 
 3   last_name   21 non-null     object 
 4   email       20 non-null     object 
 5   gender      20 non-null     object 
 6   age         21 non-null     object 
dtypes: float64(1), object(6)
memory usage: 1.4+ KB


In [59]:
demodata.head(22)

Unnamed: 0,id,full_name,first_name,last_name,email,gender,age
0,1.0,Mariel Finnigan,Mariel,Finnigan,mfinnigan0@usda.gov,Female,60
1,2.0,Kenyon Possek,Kenyon,Possek,kpossek1@ucoz.com,Male,12
2,3.0,Lalo Manifould,Lalo,Manifould,lmanifould2@pbs.org,Male,26
3,4.0,Nickola Carous,Nickola,Carous,ncarous3@phoca.cz,Male,4
4,5.0,Norman Dubbin,Norman,Dubbin,ndubbin4@wikipedia.org,Male,17
5,6.0,Hasty Perdue,Hasty,Perdue,hperdue5@qq.com,,77
6,7.0,Franz Castello,Franz,Castello,fcastello6@1688.com,Male,25
7,8.0,Jorge Tarney,Jorge,Tarney,jtarney7@ft.com,Male,77
8,9.0,Eunice Blakebrough,Eunice,Blakebrough,eblakebrough8@sohu.com,Female,45
9,10.0,Kristopher Frankcombe,Kristopher,Frankcombe,kfrankcombe9@slate.com,Male,old


### 2. Drop all-NaN rows

In [60]:
demodata.dropna(axis=0, how='all', inplace=True)

### 3. Drop dublicate rows

In [61]:
demodata.drop_duplicates(subset=demodata.columns.difference(['id']), keep='first', inplace=True, ignore_index=True)

### 4. Replace NaN fields in the gender and email column with apropriate values

In [62]:
demodata.gender = demodata.gender.fillna("unknown")
demodata.email = demodata.email.fillna("none")

### 5. Fix the id column (float to int and conscutive numbering)

In [63]:
demodata["id"] = np.arange(demodata.shape[0])

### 6. Replace "old" with "99" and convert age column from objact to positive integer

In [64]:
demodata["age"].replace({"old": "99"}, inplace=True)
demodata['age'] = demodata['age'].astype(int)
demodata['age'] = demodata['age'].abs()

### 7. Print result

In [65]:
demodata.head(22)
# no need to write csv
# demodata.to_csv("dsm-beuth-edl-demodata-cleaned.csv")

Unnamed: 0,id,full_name,first_name,last_name,email,gender,age
0,0,Mariel Finnigan,Mariel,Finnigan,mfinnigan0@usda.gov,Female,60
1,1,Kenyon Possek,Kenyon,Possek,kpossek1@ucoz.com,Male,12
2,2,Lalo Manifould,Lalo,Manifould,lmanifould2@pbs.org,Male,26
3,3,Nickola Carous,Nickola,Carous,ncarous3@phoca.cz,Male,4
4,4,Norman Dubbin,Norman,Dubbin,ndubbin4@wikipedia.org,Male,17
5,5,Hasty Perdue,Hasty,Perdue,hperdue5@qq.com,unknown,77
6,6,Franz Castello,Franz,Castello,fcastello6@1688.com,Male,25
7,7,Jorge Tarney,Jorge,Tarney,jtarney7@ft.com,Male,77
8,8,Eunice Blakebrough,Eunice,Blakebrough,eblakebrough8@sohu.com,Female,45
9,9,Kristopher Frankcombe,Kristopher,Frankcombe,kfrankcombe9@slate.com,Male,99
