# Modifying DataFrames

## Casting Types

Casting types in pandas allows you to convert the data type of columns or Series for better performance, memory usage, or compatibility.


### `astype()`
The `astype()` method converts a Series or DataFrame to the specified data type.  

#### Examples:
- Convert to `float`:
  ```python
  titanic["age"] = titanic["age"].astype("float")
  ```
- Convert to category (useful for columns with repeated values to save memory):
    ```python
    titanic["sex"] = titanic["sex"].astype("category")
    ```

### pd.to_numeric
The `pd.to_numeric()` function converts a Series to a numeric type (int or float).

It handles invalid parsing gracefully with the **errors parameter**:

- **errors='raise'** (default): Raises an error for invalid values.
- **errors='coerce'**: Converts invalid values to NaN.
- **errors='ignore'**: Leaves the original data untouched.

Example:
- Convert to numeric, coercing invalid values to NaN:
    ```python
    titanic["age"] = pd.to_numeric(titanic["age"], errors="coerce")
    ```

Using these methods ensures proper handling and conversion of data types for efficient analysis and operations.

In [2]:
import pandas as pd
houses = pd.read_csv('../data/kc_house_data.csv')
titanic = pd.read_csv('../data/titanic.csv')
netflix = pd.read_csv('../data/netflix_titles.csv', sep="|", index_col=0)
btc = pd.read_csv('../data/coin_Bitcoin.csv')
countries = pd.read_csv('../data/world-happiness-report-2021.csv')

In [5]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB


In [10]:
titanic["age"].replace(["?"], [None], inplace=True)
titanic

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,?,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11,?,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.55,C22 C26,S,?,135,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1,2,113781,151.55,C22 C26,S,?,?,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,?,C,?,328,?
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,?,C,?,?,?
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,?,C,?,304,?
1307,3,0,"Zakarian, Mr. Ortin",male,27,0,0,2670,7.225,?,C,?,?,?


In [14]:
titanic["age"] = titanic["age"].astype("float")
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1309 non-null   object 
 9   cabin      1309 non-null   object 
 10  embarked   1309 non-null   object 
 11  boat       1309 non-null   object 
 12  body       1309 non-null   object 
 13  home.dest  1309 non-null   object 
dtypes: float64(1), int64(4), object(9)
memory usage: 143.3+ KB


In [19]:
titanic["sex"] = titanic["sex"].astype("category")
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   int64   
 1   survived   1309 non-null   int64   
 2   name       1309 non-null   object  
 3   sex        1309 non-null   category
 4   age        1046 non-null   float64 
 5   sibsp      1309 non-null   int64   
 6   parch      1309 non-null   int64   
 7   ticket     1309 non-null   object  
 8   fare       1309 non-null   object  
 9   cabin      1309 non-null   object  
 10  embarked   1309 non-null   object  
 11  boat       1309 non-null   object  
 12  body       1309 non-null   object  
 13  home.dest  1309 non-null   object  
dtypes: category(1), float64(1), int64(4), object(8)
memory usage: 134.5+ KB


In [25]:
titanic = pd.read_csv("../data/titanic.csv")
titanic["age"] = pd.to_numeric(titanic["age"], errors="coerce")

In [26]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1309 non-null   object 
 9   cabin      1309 non-null   object 
 10  embarked   1309 non-null   object 
 11  boat       1309 non-null   object 
 12  body       1309 non-null   object 
 13  home.dest  1309 non-null   object 
dtypes: float64(1), int64(4), object(9)
memory usage: 143.3+ KB


## NA Values

NA (Not Available) values represent missing data in pandas. There are several methods to handle these values effectively.

### `isna`
The `isna()` method checks for missing values and returns `True` for NA values.  
- **For a DataFrame**: Returns a mask (boolean DataFrame) indicating missing values.
- **For a Series**: Returns a boolean Series.

#### Example:
```python
stats.isna()
```

### dropna
The dropna() method removes missing values.

- **For a Series**: Returns a Series with missing values removed.
- **For a DataFrame**: Removes rows or columns containing missing values.

#### Parameters:
- **inplace**: Removes missing values directly in the original object if set to True.
- **how**: Determines whether to drop rows/columns with:
    - **"any"**: Drops if any value is missing (default).
    - **"all"**: Drops only if all values are missing.
- **subset**: A list of column names to check for NA values.
- **axis(1)**: Removes columns.

#### Example:
``` Python
# Drop rows with missing values in a Series
stats["assists"].dropna()

# Drop rows with missing values in a DataFrame
stats.dropna(inplace=True)

# Drop columns with missing values
stats.dropna(axis=1)
```

### fillna

The fillna() method replaces NA values with specified values.

**Value parameter**: The replacement value(s) can be:
- A **single value** (applied to all missing values).
- A **dictionary** specifying replacement values per column.
- A **Series** specifying replacement values per index.

#### Parameters:
- inplace: Replaces missing values in the original object if set to True.

#### Example:
``` Python
# Fill all missing values with a single value
stats.fillna(value=0, inplace=True)

# Fill missing values with a dictionary
stats.fillna({"column1": 0, "column2": 5}, inplace=True)

# Fill missing values with a Series
fill_values = pd.Series([1, 2], index=["column1", "column2"])
stats.fillna(fill_values, inplace=True)
```
These methods provide flexible ways to identify, remove, or fill missing data.


In [28]:
stats = pd.read_csv("../data/game_stats.csv")
stats.isna()

Unnamed: 0,name,league,points,assists,rebounds
0,False,False,False,False,False
1,False,True,False,True,False
2,False,False,True,True,True
3,False,False,False,True,False
4,False,True,False,True,True
5,False,False,False,False,False
6,True,True,True,True,True


In [30]:
stats["assists"].dropna()

0    5.0
5    8.0
Name: assists, dtype: float64

In [37]:
stats.dropna(subset=["league", "points"])

Unnamed: 0,name,league,points,assists,rebounds
0,bob,nba,22.0,5.0,10.0
3,jackson,aba,9.0,,2.0
5,steph,nba,49.0,8.0,10.0


In [38]:
stats.fillna(0)

Unnamed: 0,name,league,points,assists,rebounds
0,bob,nba,22.0,5.0,10.0
1,jessie,0,10.0,0.0,2.0
2,stu,euroleague,0.0,0.0,0.0
3,jackson,aba,9.0,0.0,2.0
4,timothee,0,8.0,0.0,0.0
5,steph,nba,49.0,8.0,10.0
6,0,0,0.0,0.0,0.0


In [39]:
stats["league"].fillna("amateur")

0           nba
1       amateur
2    euroleague
3           aba
4       amateur
5           nba
6       amateur
Name: league, dtype: object