# Data Cleansing and Casting Data Types

[Reference - Python Pandas Tutorial (Part 9): Cleaning Data - Casting Datatypes and Handling Missing Values By Corey Schafer](https://www.youtube.com/watch?v=KdmPHEnPJPs&ab_channel=CoreySchafer)

重點整理:

- 用`np.nan`來取代string的Null value
- `Series` or `DataFrame` `.unique()`: 查看`object`包含的資料類型有哪些

In [1]:
import pandas as pd
import numpy as np

In [2]:
DUMMY_DATA = [
    {'Last_Name': 'Smith', 'First_Name': 'John', 'Age': 35, 'Email': 'john.smith@example.com'},
    {'Last_Name': 'Johnson', 'First_Name': 'Emily', 'Age': 28, 'Email': None},
    {'Last_Name': 'Williams', 'First_Name': 'Michael', 'Age': 42, 'Email': 'm.williams@example.com'},
    {'Last_Name': 'Brown', 'First_Name': 'Jessica', 'Age': None, 'Email': 'jess.brown@example.com'},
    {'Last_Name': 'Jones', 'First_Name': 'David', 'Age': 51, 'Email': 'david.jones@example.com'},
    {'Last_Name': 'Garcia', 'First_Name': 'Ashley', 'Age': 31, 'Email': 'ashley.g@example.com'},
    {'Last_Name': None, 'First_Name': 'Chris', 'Age': 45, 'Email': 'NA'},
    {'Last_Name': 'NA', 'First_Name': 'Missing', 'Age': None, 'Email': 'Missing'},
    {'Last_Name': 'Rodriguez', 'First_Name': 'James', 'Age': 38, 'Email': None},
    {'Last_Name': 'Martinez', 'First_Name': 'Jennifer', 'Age': 22, 'Email': 'jennifer.m@example.com'}
]
COLUMNS = ['Last_Name', 'First_Name', 'Age', 'Email']

df = pd.DataFrame(
    data=DUMMY_DATA,
    columns=COLUMNS,
)
df.head()

Unnamed: 0,Last_Name,First_Name,Age,Email
0,Smith,John,35.0,john.smith@example.com
1,Johnson,Emily,28.0,
2,Williams,Michael,42.0,m.williams@example.com
3,Brown,Jessica,,jess.brown@example.com
4,Jones,David,51.0,david.jones@example.com


In [3]:
# 移除Null最簡單的方式
df.dropna()

Unnamed: 0,Last_Name,First_Name,Age,Email
0,Smith,John,35.0,john.smith@example.com
2,Williams,Michael,42.0,m.williams@example.com
4,Jones,David,51.0,david.jones@example.com
5,Garcia,Ashley,31.0,ashley.g@example.com
9,Martinez,Jennifer,22.0,jennifer.m@example.com


In [4]:
df.dropna(axis='index', how='any')

Unnamed: 0,Last_Name,First_Name,Age,Email
0,Smith,John,35.0,john.smith@example.com
2,Williams,Michael,42.0,m.williams@example.com
4,Jones,David,51.0,david.jones@example.com
5,Garcia,Ashley,31.0,ashley.g@example.com
9,Martinez,Jennifer,22.0,jennifer.m@example.com


In [5]:
# 對於each index (rows) 都需要滿足全部數據都是Null才drop
df.dropna(axis='index', how='all')

Unnamed: 0,Last_Name,First_Name,Age,Email
0,Smith,John,35.0,john.smith@example.com
1,Johnson,Emily,28.0,
2,Williams,Michael,42.0,m.williams@example.com
3,Brown,Jessica,,jess.brown@example.com
4,Jones,David,51.0,david.jones@example.com
5,Garcia,Ashley,31.0,ashley.g@example.com
6,,Chris,45.0,
7,,Missing,,Missing
8,Rodriguez,James,38.0,
9,Martinez,Jennifer,22.0,jennifer.m@example.com


In [6]:
# 使用Subset可以針對指定的columns進行處理
df.dropna(
    axis='index',
    subset=['Email'],
    how='all'
)

Unnamed: 0,Last_Name,First_Name,Age,Email
0,Smith,John,35.0,john.smith@example.com
2,Williams,Michael,42.0,m.williams@example.com
3,Brown,Jessica,,jess.brown@example.com
4,Jones,David,51.0,david.jones@example.com
5,Garcia,Ashley,31.0,ashley.g@example.com
6,,Chris,45.0,
7,,Missing,,Missing
9,Martinez,Jennifer,22.0,jennifer.m@example.com


In [7]:
df.dropna(
    axis='index',
    subset=['Last_Name', 'Email'],
    how='all'  # 這時候的all會確定subset所有資料都為null時才drop
)

Unnamed: 0,Last_Name,First_Name,Age,Email
0,Smith,John,35.0,john.smith@example.com
1,Johnson,Emily,28.0,
2,Williams,Michael,42.0,m.williams@example.com
3,Brown,Jessica,,jess.brown@example.com
4,Jones,David,51.0,david.jones@example.com
5,Garcia,Ashley,31.0,ashley.g@example.com
6,,Chris,45.0,
7,,Missing,,Missing
8,Rodriguez,James,38.0,
9,Martinez,Jennifer,22.0,jennifer.m@example.com


In [8]:
# 把原始DataFrame中的'NA'字串，replace為np.nan
df.replace('NA', np.nan, inplace=True)
df

Unnamed: 0,Last_Name,First_Name,Age,Email
0,Smith,John,35.0,john.smith@example.com
1,Johnson,Emily,28.0,
2,Williams,Michael,42.0,m.williams@example.com
3,Brown,Jessica,,jess.brown@example.com
4,Jones,David,51.0,david.jones@example.com
5,Garcia,Ashley,31.0,ashley.g@example.com
6,,Chris,45.0,
7,,Missing,,Missing
8,Rodriguez,James,38.0,
9,Martinez,Jennifer,22.0,jennifer.m@example.com


In [9]:
# 這次dropna就會把index 6的Chris drop掉，因為'NA'字串已經被取代為Null值
df.dropna(
    axis='index',
    subset=['Last_Name', 'Email'],
    how='all'
)

Unnamed: 0,Last_Name,First_Name,Age,Email
0,Smith,John,35.0,john.smith@example.com
1,Johnson,Emily,28.0,
2,Williams,Michael,42.0,m.williams@example.com
3,Brown,Jessica,,jess.brown@example.com
4,Jones,David,51.0,david.jones@example.com
5,Garcia,Ashley,31.0,ashley.g@example.com
7,,Missing,,Missing
8,Rodriguez,James,38.0,
9,Martinez,Jennifer,22.0,jennifer.m@example.com


In [10]:
# isna() -> Boolean mask
df.isna()

Unnamed: 0,Last_Name,First_Name,Age,Email
0,False,False,False,False
1,False,False,False,True
2,False,False,False,False
3,False,False,True,False
4,False,False,False,False
5,False,False,False,False
6,True,False,False,True
7,True,False,True,False
8,False,False,False,True
9,False,False,False,False


In [11]:
# fillna() -> 一次批輛處理所有na value的方法
# df.fillna('MISSING')
df.fillna(0)

Unnamed: 0,Last_Name,First_Name,Age,Email
0,Smith,John,35.0,john.smith@example.com
1,Johnson,Emily,28.0,0
2,Williams,Michael,42.0,m.williams@example.com
3,Brown,Jessica,0.0,jess.brown@example.com
4,Jones,David,51.0,david.jones@example.com
5,Garcia,Ashley,31.0,ashley.g@example.com
6,0,Chris,45.0,0
7,0,Missing,0.0,Missing
8,Rodriguez,James,38.0,0
9,Martinez,Jennifer,22.0,jennifer.m@example.com


In [12]:
# grab all the data type of columns
df.dtypes

Last_Name      object
First_Name     object
Age           float64
Email          object
dtype: object

※ `pandas` `DataFrame`中，object指的是string or mixed data types

In [13]:
# np.nan -> 實際上是float type
type(np.nan)

float

所以如果一個column(series)裡包含的資料有integer & `np.nan`，則`astype`為`float`

In [14]:
df['Age'].value_counts(dropna=False)

Age
NaN     2
35.0    1
28.0    1
42.0    1
51.0    1
31.0    1
45.0    1
38.0    1
22.0    1
Name: count, dtype: int64

In [15]:
df['Age'] = df['Age'].astype(float)

In [16]:
df.dtypes

Last_Name      object
First_Name     object
Age           float64
Email          object
dtype: object

In [17]:
df['Age'].mean()

np.float64(36.5)

## Survey

In [18]:
# 設立NA values list
na_vals = ['NA', 'Missing']

In [19]:
# 在讀取csv file時，把NA values置換
df = pd.read_csv(
    '../dataset/survey_results_public.csv',
    index_col='ResponseId',
    na_values=na_vals
)
schema_df = pd.read_csv(
    '../dataset/survey_results_schema.csv',
    index_col='qname'
)

In [20]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [21]:
df.head()

Unnamed: 0_level_0,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,TechDoc,YearsCode,YearsCodePro,DevType,OrgSize,PurchaseInfluence,BuyNewTool,BuildvsBuy,TechEndorse,Country,Currency,CompTotal,LanguageHaveWorkedWith,LanguageWantToWorkWith,LanguageAdmired,DatabaseHaveWorkedWith,DatabaseWantToWorkWith,DatabaseAdmired,PlatformHaveWorkedWith,PlatformWantToWorkWith,PlatformAdmired,WebframeHaveWorkedWith,WebframeWantToWorkWith,WebframeAdmired,EmbeddedHaveWorkedWith,EmbeddedWantToWorkWith,EmbeddedAdmired,MiscTechHaveWorkedWith,MiscTechWantToWorkWith,MiscTechAdmired,ToolsTechHaveWorkedWith,ToolsTechWantToWorkWith,ToolsTechAdmired,NEWCollabToolsHaveWorkedWith,NEWCollabToolsWantToWorkWith,NEWCollabToolsAdmired,OpSysPersonal use,OpSysProfessional use,OfficeStackAsyncHaveWorkedWith,OfficeStackAsyncWantToWorkWith,OfficeStackAsyncAdmired,OfficeStackSyncHaveWorkedWith,OfficeStackSyncWantToWorkWith,OfficeStackSyncAdmired,AISearchDevHaveWorkedWith,AISearchDevWantToWorkWith,AISearchDevAdmired,NEWSOSites,SOVisitFreq,SOAccount,SOPartFreq,SOHow,SOComm,AISelect,AISent,AIBen,AIAcc,AIComplex,AIToolCurrently Using,AIToolInterested in Using,AIToolNot interested in Using,AINextMuch more integrated,AINextNo change,AINextMore integrated,AINextLess integrated,AINextMuch less integrated,AIThreat,AIEthics,AIChallenges,TBranch,ICorPM,WorkExp,Knowledge_1,Knowledge_2,Knowledge_3,Knowledge_4,Knowledge_5,Knowledge_6,Knowledge_7,Knowledge_8,Knowledge_9,Frequency_1,Frequency_2,Frequency_3,TimeSearching,TimeAnswering,Frustration,ProfessionalTech,ProfessionalCloud,ProfessionalQuestion,Industry,JobSatPoints_1,JobSatPoints_4,JobSatPoints_5,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
ResponseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1
1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,,,,,,,,,,United States of America,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,I have never visited Stack Overflow or the Sta...,,,,,,Yes,Very favorable,Increase productivity,,,,,,,,,,,,,,No,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,API document(s) and/or SDK document(s);User gu...,20.0,17.0,"Developer, full-stack",,,,,,United Kingdom of Great Britain and Northern I...,,,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Bash/Shell (all shells);Go;HTML/CSS;Java;JavaS...,Dynamodb;MongoDB;PostgreSQL,PostgreSQL,PostgreSQL,Amazon Web Services (AWS);Heroku;Netlify,Amazon Web Services (AWS);Heroku;Netlify,Amazon Web Services (AWS);Heroku;Netlify,Express;Next.js;Node.js;React,Express;Htmx;Node.js;React;Remix,Express;Node.js;React,,,,,,,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,Docker;Homebrew;Kubernetes;npm;Vite;Webpack,PyCharm;Visual Studio Code;WebStorm,PyCharm;Visual Studio Code;WebStorm,PyCharm;Visual Studio Code;WebStorm,MacOS;Windows,MacOS,,,,Microsoft Teams;Slack,Slack,Slack,,,,Stack Overflow for Teams (private knowledge sh...,Multiple times per day,Yes,Multiple times per day,Quickly finding code solutions;Finding reliabl...,"Yes, definitely","No, and I don't plan to",,,,,,,,,,,,,,,,Yes,Individual contributor,17.0,Agree,Disagree,Agree,Agree,Agree,Neither agree nor disagree,Disagree,Agree,Agree,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,
3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,API document(s) and/or SDK document(s);User gu...,37.0,27.0,Developer Experience,,,,,,United Kingdom of Great Britain and Northern I...,,,C#,C#,C#,Firebase Realtime Database,Firebase Realtime Database,Firebase Realtime Database,Google Cloud,Google Cloud,Google Cloud,ASP.NET CORE,ASP.NET CORE,ASP.NET CORE,Rasberry Pi,Rasberry Pi,Rasberry Pi,.NET (5+) ;.NET Framework (1.0 - 4.8);.NET MAUI,.NET (5+) ;.NET Framework (1.0 - 4.8);.NET MAUI,.NET (5+) ;.NET Framework (1.0 - 4.8);.NET MAUI,MSBuild,MSBuild,MSBuild,Visual Studio,Visual Studio,Visual Studio,Windows,Windows,,,,Google Chat;Google Meet;Microsoft Teams;Zoom,Google Chat;Google Meet;Zoom,Google Chat;Google Meet;Zoom,,,,Stack Overflow;Stack Exchange;Stack Overflow B...,Multiple times per day,Yes,Multiple times per day,Quickly finding code solutions;Finding reliabl...,"Yes, definitely","No, and I don't plan to",,,,,,,,,,,,,,,,No,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Appropriate in length,Easy,,
4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,,4.0,,"Developer, full-stack",,,,,,Canada,,,C;C++;HTML/CSS;Java;JavaScript;PHP;PowerShell;...,HTML/CSS;Java;JavaScript;PowerShell;Python;SQL...,HTML/CSS;Java;JavaScript;PowerShell;Python;SQL...,MongoDB;MySQL;PostgreSQL;SQLite,MongoDB;MySQL;PostgreSQL,MongoDB;MySQL;PostgreSQL,Amazon Web Services (AWS);Fly.io;Heroku,Amazon Web Services (AWS);Vercel,Amazon Web Services (AWS),jQuery;Next.js;Node.js;React;WordPress,jQuery;Next.js;Node.js;React,jQuery;Next.js;Node.js;React,Rasberry Pi,,,NumPy;Pandas;Ruff;TensorFlow,,,Docker;npm;Pip,Docker;Kubernetes;npm,Docker;npm,,,,,,,,,,,,,,,Stack Overflow,Daily or almost daily,No,,Quickly finding code solutions,"No, not really",Yes,Very favorable,Increase productivity;Greater efficiency;Impro...,Somewhat trust,Bad at handling complex tasks,Learning about a codebase;Project planning;Wri...,Testing code;Committing and reviewing code;Pre...,,Learning about a codebase;Project planning;Wri...,,,,,No,Circulating misinformation or disinformation;M...,Don’t trust the output or answers,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Too long,Easy,,
5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,API document(s) and/or SDK document(s);User gu...,9.0,,"Developer, full-stack",,,,,,Norway,,,C++;HTML/CSS;JavaScript;Lua;Python;Rust,C++;HTML/CSS;JavaScript;Lua;Python,C++;HTML/CSS;JavaScript;Lua;Python,PostgreSQL;SQLite,PostgreSQL;SQLite,PostgreSQL;SQLite,,,,,,,CMake;Cargo;Rasberry Pi,CMake;Rasberry Pi,CMake;Rasberry Pi,,,,APT;Make;npm,APT;Make,APT;Make,Vim,Vim,Vim,Other (please specify):,,GitHub Discussions;Markdown File;Obsidian;Stac...,GitHub Discussions;Markdown File;Obsidian,GitHub Discussions;Markdown File;Obsidian,Discord;Whatsapp,Discord;Whatsapp,Discord;Whatsapp,,,,Stack Overflow for Teams (private knowledge sh...,Multiple times per day,Yes,Multiple times per day,Quickly finding code solutions;Engage with com...,"Yes, definitely","No, and I don't plan to",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Too short,Easy,,


In [22]:
df['YearsCode'].head(10)

ResponseId
1     NaN
2      20
3      37
4       4
5       9
6      10
7       7
8       1
9      20
10     15
Name: YearsCode, dtype: object

In [23]:
# 用.unique()方法確認包含了那些數值 -> return ndarray
df['YearsCode'].unique()

array([nan, '20', '37', '4', '9', '10', '7', '1', '15', '30', '31', '6',
       '12', '22', '5', '36', '25', '44', '24', '18', '3', '8',
       'More than 50 years', '11', '29', '40', '39', '2', '42', '34',
       '19', '35', '16', '33', '13', '23', '14', '28', '17', '21', '43',
       '46', '26', '32', '41', '45', '27', '38', '50', '48', '47',
       'Less than 1 year', '49'], dtype=object)

In [24]:
# 把string取代成數值
df['YearsCode'].replace('Less than 1 year', 0, inplace=True)
df['YearsCode'].replace('More than 50 years', 51, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['YearsCode'].replace('Less than 1 year', 0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['YearsCode'].replace('More than 50 years', 51, inplace=True)


In [25]:
# 完成後更新data type
df['YearsCode'] = df['YearsCode'].astype(float)

In [26]:
df['YearsCode'].mean()

np.float64(14.197497870350265)

In [27]:
df['YearsCode'].median()

np.float64(11.0)