<a href="https://colab.research.google.com/github/recervictory/LearingPython/blob/Student/08%20-%20Pandas%20III%20-%20Data%20Cleaning%20and%20Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Cleaning and Preparation

During the course of doing data analysis and modeling, a *significant amount of time* is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up *80% or more of an analyst’s time*.



In [None]:
import pandas as pd
import numpy as np
from numpy import nan as NA # represent NaN as NA

## A. Handling Missing Data
Missing data occurs commonly in many data analysis applications. One of the goals
of pandas is to make working with missing data as painless as possible. For example,
all of the descriptive statistics on pandas objects exclude missing data by default.

The way that missing data is represented in pandas objects is somewhat imperfect,
but it is functional for a lot of users. For numeric data, pandas uses the floating-point
value NaN (Not a Number) to represent missing data.

The built-in Python **None** value is also treated as NA in object arrays:

In [None]:
string_data = pd.Series(['Kolkata', 'Delhi', np.nan, 'Bangalore'])
string_data

0      Kolkata
1        Delhi
2          NaN
3    Bangalore
dtype: object

In [None]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [None]:
# The built-in Python None value is also treated as NA in object arrays:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

### NA handling methods
- `dropna` Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
- `fillna` Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
- `isnull` Return boolean values indicating which values are missing/NA.
- `notnull` Negation of isnull

### Filtering Out Missing Data
While you always have the option to do it by hand using `pandas.isnull` and boolean indexing, the `dropna` can be helpful.

In [None]:
data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [None]:
# Droping the Data
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [None]:
# This is equivalent to:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

With DataFrame objects, things are a bit more complex. You may want to drop rows
or columns that are all NA or only those containing any `NAs`. 
The `dropna` by default drops **any row containing a missing value**:

In [None]:
 
 data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
 cleaned = data.dropna()
 cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [None]:
# Passing how='all' will only drop rows that are all NA:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [None]:
# To drop columns in the same way, pass axis=1:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [None]:
# Drop data column wise
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


### Filling In Missing Data
For most purposes, the fillna method is the workhorse function to use. Calling fillna with a **constant** replaces **missing values** with that value:

In [None]:
df = pd.DataFrame(np.random.randn(7, 3), columns=['gold', 'silver', 'copper'])
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,gold,silver,copper
0,-0.226786,,
1,-0.373047,,
2,0.959284,,-0.92505
3,-1.122867,,0.99356
4,-0.266494,-0.067346,0.306744
5,0.473559,1.847703,0.290915
6,-2.891568,0.871508,1.095741


In [None]:
# Fill The missing values with Zero
df.fillna(0)

Unnamed: 0,gold,silver,copper
0,-0.268859,0.0,0.0
1,-0.075585,0.0,0.0
2,-0.85752,0.0,-0.330685
3,0.030748,0.0,0.160671
4,1.595329,-0.989645,-1.623879
5,0.458323,-1.080568,-0.73787
6,-0.747513,-1.47407,0.224658


In [None]:
# Calling fillna with a dict, you can use a different fill value for each column:
df.fillna({'silver': -1, 'copper': 1})

Unnamed: 0,gold,silver,copper
0,-0.268859,-1.0,1.0
1,-0.075585,-1.0,1.0
2,-0.85752,-1.0,-0.330685
3,0.030748,-1.0,0.160671
4,1.595329,-0.989645,-1.623879
5,0.458323,-1.080568,-0.73787
6,-0.747513,-1.47407,0.224658


In [None]:
# fillna returns a new object, but you can modify the existing object in-place:
df.fillna(0)
print(df)
df.fillna(0, inplace=True) # Important
print(df)

       gold    silver    copper
0 -0.226786       NaN       NaN
1 -0.373047       NaN       NaN
2  0.959284       NaN -0.925050
3 -1.122867       NaN  0.993560
4 -0.266494 -0.067346  0.306744
5  0.473559  1.847703  0.290915
6 -2.891568  0.871508  1.095741
       gold    silver    copper
0 -0.226786  0.000000  0.000000
1 -0.373047  0.000000  0.000000
2  0.959284  0.000000 -0.925050
3 -1.122867  0.000000  0.993560
4 -0.266494 -0.067346  0.306744
5  0.473559  1.847703  0.290915
6 -2.891568  0.871508  1.095741


The same **interpolation** methods available for reindexing can be used with fillna:

In [None]:
# Creating Dataframe
df = pd.DataFrame(np.random.randn(6, 3), columns=['gold', 'silver', 'copper'])
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,gold,silver,copper
0,-1.210169,-1.370134,2.177798
1,1.305728,-1.432419,0.25158
2,-1.225027,,0.181317
3,-0.228302,,0.065387
4,-1.238665,,
5,-0.2244,,


In [None]:
# Fill 'NA' with forword fill method
df.fillna(method='ffill')

Unnamed: 0,gold,silver,copper
0,-1.077324,-0.283236,-0.070706
1,0.509668,0.843724,-0.526803
2,2.224613,0.843724,0.64585
3,-0.306794,0.843724,0.887089
4,-0.316525,0.843724,0.887089
5,0.00238,0.843724,0.887089


In [None]:
# limit by row
df.fillna(method='ffill', limit=2)

Unnamed: 0,gold,silver,copper
0,-1.077324,-0.283236,-0.070706
1,0.509668,0.843724,-0.526803
2,2.224613,0.843724,0.64585
3,-0.306794,0.843724,0.887089
4,-0.316525,,0.887089
5,0.00238,,0.887089


In [None]:
# you might pass the mean or median values
df.fillna(df.mean())

Unnamed: 0,gold,silver,copper
0,-1.210169,-1.370134,2.177798
1,1.305728,-1.432419,0.25158
2,-1.225027,-1.401277,0.181317
3,-0.228302,-1.401277,0.065387
4,-1.238665,-1.401277,0.669021
5,-0.2244,-1.401277,0.669021


## B. Data Transformation

### Removing Duplicates
Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example:

In [None]:
data = pd.DataFrame({'city': ['kolkata', 'delhi'] * 3 + ['delhi'],'count': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,city,count
0,kolkata,1
1,delhi,1
2,kolkata,2
3,delhi,3
4,kolkata,3
5,delhi,4
6,delhi,4


The DataFrame method `duplicated()` returns a **boolean Series** indicating whether each row is a duplicate (has been observed in a previous row) or not:

In [None]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

The `drop_duplicates()` returns a DataFrame where the duplicated array is False:

In [None]:
data.drop_duplicates()

Unnamed: 0,city,count
0,kolkata,1
1,delhi,1
2,kolkata,2
3,delhi,3
4,kolkata,3
5,delhi,4


In [None]:
data['price'] = np.random.randint(10,100,size=7)
data

Unnamed: 0,city,count,price
0,kolkata,1,11
1,delhi,1,51
2,kolkata,2,78
3,delhi,3,99
4,kolkata,3,12
5,delhi,4,17
6,delhi,4,20


In [None]:
 # Drop duplicate by column
 data.drop_duplicates(['city'])

Unnamed: 0,city,count,price
0,kolkata,1,11
1,delhi,1,51


### Transforming Data Using a Function or Mapping
For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. 

In [None]:
data = pd.DataFrame({'city':['New York','Delhi','Kolkata','Chicago','Las Vegas'], 
                     'pupulation': np.random.randint(100000,1000000000,size=5)
                     })
data

Unnamed: 0,city,pupulation
0,New York,843448647
1,Delhi,120493031
2,Kolkata,782483942
3,Chicago,127963973
4,Las Vegas,221313099


In [None]:
city_to_country = {'new york':'usa','delhi':'india','kolkata':'india','chicago':'usa','las vegas':'usa'}
city_to_country

{'chicago': 'usa',
 'delhi': 'india',
 'kolkata': 'india',
 'las vegas': 'usa',
 'new york': 'usa'}

In [None]:
# We Need to cheack the data type
data.dtypes

city          object
pupulation     int64
dtype: object

In [None]:
data['city'] = data['city'].str.lower()
data

Unnamed: 0,city,pupulation
0,new york,843448647
1,delhi,120493031
2,kolkata,782483942
3,chicago,127963973
4,las vegas,221313099


In [None]:
data['country'] = data['city'].map(city_to_country)
data

Unnamed: 0,city,pupulation,country
0,new york,843448647,usa
1,delhi,120493031,india
2,kolkata,782483942,india
3,chicago,127963973,usa
4,las vegas,221313099,usa


### Replacing Values
Filling in missing data with the `fillna()` method is a special case of more general value replacement. As you’ve already seen, `map()` can be used to modify a subset of values in an object but replace provides a simpler and more flexible way to do so. 

In [None]:
data['pupulation'] = data['pupulation'].replace([843448647,127963973	],np.nan)
data

Unnamed: 0,city,pupulation,country
0,new york,,usa
1,delhi,120493031.0,india
2,kolkata,782483942.0,india
3,chicago,,usa
4,las vegas,221313099.0,usa


### Detecting and Filtering Outliers
Filtering or transforming **outliers** is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

In [None]:
data = pd.DataFrame(np.random.randn(1000, 4),columns=['Aaba','Baba','Caca','Dada'])

# Lets find out the outliers
data.describe()

Unnamed: 0,Aaba,Baba,Caca,Dada
count,1000.0,1000.0,1000.0,1000.0
mean,0.025758,-0.028117,-0.0339,0.022928
std,0.997367,1.009365,1.002591,1.019658
min,-3.416037,-3.604211,-3.923166,-3.431423
25%,-0.605312,-0.675626,-0.750218,-0.665727
50%,0.016001,-0.044608,-0.009559,-0.007187
75%,0.663454,0.674266,0.674553,0.709867
max,3.152044,3.349852,3.450299,3.078719


In [None]:
data[np.abs(data['Caca']) > 3]

Unnamed: 0,Aaba,Baba,Caca,Dada
221,-1.613693,1.034349,3.401343,-0.113541
400,-0.566499,0.316156,-3.923166,-0.657563
896,1.102362,0.610834,3.450299,-1.017217


In [None]:
# Detecting outleirs from any columns in the dataframe

data[(np.abs(data) > 3).any(1)] # axis = 1 i.e column wise

Unnamed: 0,Aaba,Baba,Caca,Dada
128,-3.416037,-0.265455,0.718571,0.016466
189,-0.558286,3.073667,-0.117057,0.669899
221,-1.613693,1.034349,3.401343,-0.113541
334,-1.232503,3.209663,-0.202893,0.039415
373,-1.798933,3.349852,-0.142262,-0.746482
400,-0.566499,0.316156,-3.923166,-0.657563
531,3.043147,-1.027101,0.059559,0.733911
600,-0.165575,-3.604211,0.284244,-1.515084
627,0.87605,-1.524627,0.935809,-3.431423
677,-0.551666,1.031106,0.317305,-3.262707


In [None]:
data[(np.abs(data) > 3).all(1)]

Unnamed: 0,Aaba,Baba,Caca,Dada


In [None]:
new_row = {'Aaba' : 4,	'Baba':4,	'Caca': -4,	'Dada': -4}
data = data.append(new_row,ignore_index=True)
data[(np.abs(data) > 3).all(1)]

Unnamed: 0,Aaba,Baba,Caca,Dada
1000,4.0,4.0,-4.0,-4.0


# Project: Risk of being drawn into online sex work

### Context
This database was used in the paper: Covert online ethnography and machine learning for detecting individuals at risk of being drawn into online sex work. 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Barcelona, Spain, 28-31 August.

### Content
The database includes data scraped from a European online adult forum. Using covert online ethnography we interviewed a small number of participants and determined their risk to either supply or demand sex services through that forum. This is a great dataset for semi-supervised learning.

### Inspiration
How can we identify individuals at risk of being drawn into online sex work? The spread of online social media enables a greater number of people to be involved into online sex trade; however, detecting deviant behaviors online is limited by the low available of data. To overcome this challenge, we combine covert online ethnography with semi-supervised learning using data from a popular European adult forum.

## Importing Data

In [58]:
import pandas as pd
import numpy as np

import warnings; warnings.filterwarnings('ignore')

In [59]:
df = pd.read_csv('/content/online_sex_work.csv', index_col=0)
df = df.iloc[: 28831, :]

df.head()

Unnamed: 0_level_0,Gender,Age,Location,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
10386.0,male,346,A,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0.0,0.0,0.0,18260,No_risk
14.0,male,322,J,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9.0,0.0,0.0,11778320244376823969273184588431277,No_risk
16721.0,male,336,K,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1.0,1.0,45.0,198052172119802,No_risk
16957.0,male,34,H,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1.0,0.0,1.0,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125.0,male,395,B,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0.0,6.0,8.0,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk


In [60]:
# Understand the Data Types
df.dtypes

Gender                                  object
Age                                     object
Location                                object
Verification                            object
Sexual_orientation                      object
Sexual_polarity                         object
Looking_for                             object
Points_Rank                             object
Last_login                              object
Member_since                            object
Number_of_Comments_in_public_forum      object
Time_spent_chating_H:M                  object
Number_of_advertisments_posted         float64
Number_of_offline_meetings_attended    float64
Profile_pictures                       float64
Friends_ID_list                         object
Risk                                    object
dtype: object

## Data Cleaning


### Change datatype for some features

Data in a number of features that contain numerical data could be converted into pure numbers (integers), which would take less memory and could be interpreted more easily by machine learning models.

In [61]:
df.index = df.index.astype(int)
df['Number_of_advertisments_posted'] = df['Number_of_advertisments_posted'].astype(int)
df['Number_of_offline_meetings_attended'] = df['Number_of_offline_meetings_attended'].astype(int)
df['Profile_pictures'] = df['Profile_pictures'].astype(int)
df['Friends_ID_list'] = df['Friends_ID_list'].astype(str)
df['Risk'] = df['Risk'].astype(str)

df.head()

Unnamed: 0_level_0,Gender,Age,Location,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
10386,male,346,A,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0,0,0,18260,No_risk
14,male,322,J,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9,0,0,11778320244376823969273184588431277,No_risk
16721,male,336,K,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1,1,45,198052172119802,No_risk
16957,male,34,H,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1,0,1,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125,male,395,B,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0,6,8,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk


In [62]:
df.dtypes

Gender                                 object
Age                                    object
Location                               object
Verification                           object
Sexual_orientation                     object
Sexual_polarity                        object
Looking_for                            object
Points_Rank                            object
Last_login                             object
Member_since                           object
Number_of_Comments_in_public_forum     object
Time_spent_chating_H:M                 object
Number_of_advertisments_posted          int64
Number_of_offline_meetings_attended     int64
Profile_pictures                        int64
Friends_ID_list                        object
Risk                                   object
dtype: object

In [63]:
# cheack the Error
# df['Number_of_Comments_in_public_forum'] = df['Number_of_Comments_in_public_forum'].astype(int)

In [64]:
df['Number_of_Comments_in_public_forum'] = df['Number_of_Comments_in_public_forum'].str.replace(' ', '').astype(int)

### Counting the Missing Values

In [65]:
# Count of missing values column wise
df.isnull().sum()

Gender                                   4
Age                                      0
Location                                 1
Verification                             0
Sexual_orientation                       1
Sexual_polarity                          1
Looking_for                            425
Points_Rank                              0
Last_login                               0
Member_since                             0
Number_of_Comments_in_public_forum       0
Time_spent_chating_H:M                   0
Number_of_advertisments_posted           0
Number_of_offline_meetings_attended      0
Profile_pictures                         0
Friends_ID_list                          0
Risk                                     0
dtype: int64

### Convert `Gender` to binary data

In the `Gender` column, We fill some missing values using some simple conditions (if the entry is, for example, homosexual, and looking for men, we fill that entry with `male`), using the `fill_gender_na` function below. Then in every entry, we change the data to whether it specifies `female` or not.

In [66]:
def fill_gender_na(row):
    if row['Sexual_orientation'] == 'Homosexual':
        if row['Looking_for'] == 'Men':
            return 'male'
        elif row['Looking_for'] == 'Women':
            return 'female'
    elif row['Sexual_orientation'] == 'Heterosexual':
        if row['Looking_for'] == 'Men':
            return 'female'
        elif row['Looking_for'] == 'Women':
            return 'male'
    return np.nan

In [67]:
## Fill the missing data
fill_values = df.apply(fill_gender_na, axis=1)
df['Gender'].fillna(fill_values, inplace=True)

In [68]:
# Lets check the missing values
df.isnull().sum()

Gender                                   4
Age                                      0
Location                                 1
Verification                             0
Sexual_orientation                       1
Sexual_polarity                          1
Looking_for                            425
Points_Rank                              0
Last_login                               0
Member_since                             0
Number_of_Comments_in_public_forum       0
Time_spent_chating_H:M                   0
Number_of_advertisments_posted           0
Number_of_offline_meetings_attended      0
Profile_pictures                         0
Friends_ID_list                          0
Risk                                     0
dtype: int64

In [69]:
# Add missing value with summary statistics 
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df.head()

Unnamed: 0_level_0,Gender,Age,Location,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
10386,male,346,A,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0,0,0,18260,No_risk
14,male,322,J,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9,0,0,11778320244376823969273184588431277,No_risk
16721,male,336,K,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1,1,45,198052172119802,No_risk
16957,male,34,H,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1,0,1,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125,male,395,B,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0,6,8,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk


In [70]:
# Lets check the missing values
df.isnull().sum()

Gender                                   0
Age                                      0
Location                                 1
Verification                             0
Sexual_orientation                       1
Sexual_polarity                          1
Looking_for                            425
Points_Rank                              0
Last_login                               0
Member_since                             0
Number_of_Comments_in_public_forum       0
Time_spent_chating_H:M                   0
Number_of_advertisments_posted           0
Number_of_offline_meetings_attended      0
Profile_pictures                         0
Friends_ID_list                          0
Risk                                     0
dtype: int64

### Insert new Binary column named 'Female'

In [71]:
df.insert(0, 'Female', df['Gender'] == 'female')
df.head()

Unnamed: 0_level_0,Female,Gender,Age,Location,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
10386,False,male,346,A,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0,0,0,18260,No_risk
14,False,male,322,J,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9,0,0,11778320244376823969273184588431277,No_risk
16721,False,male,336,K,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1,1,45,198052172119802,No_risk
16957,False,male,34,H,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1,0,1,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125,False,male,395,B,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0,6,8,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk


### Decimal points in `Age`

We replace all commas (European decimal separator) with periods, while handling some unformatted values.

In [76]:
def comma_replace(obj):
  return obj.replace(",",".")

df['Age'].head().apply(comma_replace)

User_ID
10386    34.6
14       32.2
16721    33.6
16957      34
17125    39.5
Name: Age, dtype: object

In [77]:
# Lets do with single line with lambda
df['Age'] = df['Age'].apply(lambda obj: obj.replace(',', '.'))
df.head()

Unnamed: 0_level_0,Female,Gender,Age,Location,Verification,Sexual_orientation,Sexual_polarity,Looking_for,Points_Rank,Last_login,Member_since,Number_of_Comments_in_public_forum,Time_spent_chating_H:M,Number_of_advertisments_posted,Number_of_offline_meetings_attended,Profile_pictures,Friends_ID_list,Risk
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
10386,False,male,34.6,A,Non_Verified,Homosexual,Switch,Men,50,before_10_days,17.9.2012,32,0:2,0,0,0,18260,No_risk
14,False,male,32.2,J,Non_Verified,Heterosexual,Dominant,Women,518,before_1_days,1.11.2009,710,3:45,9,0,0,11778320244376823969273184588431277,No_risk
16721,False,male,33.6,K,Non_Verified,Heterosexual,Dominant,Women,150,before_3_days,1.4.2013,25,2:15,1,1,45,198052172119802,No_risk
16957,False,male,34.0,H,Non_Verified,Heterosexual,Dominant,Women,114,before_4_days,8.4.2013,107,359:22,1,0,1,"40847,38183,9507,42259,5807,28118,24848,37170,...",No_risk
17125,False,male,39.5,B,Non_Verified,Heterosexual,Dominant,Women,497,before_5_days,14.4.2013,600,0:21,0,6,8,"1320,35739,34231,19097,20197,18069,12330,43342...",No_risk
