# STAT1100 Data Communication and Modelling

## Deleting Data

In [1]:
import pandas as pd
print(pd.__version__)

1.2.3


The `dummy` data set is a small data set containing erroneous personal data that has been created for the purpose of this exercise. Since the data set is small enough we can view it in its entirety.

In [2]:
dummy = pd.read_csv("data/dummy person data.csv")
dummy.info()
dummy

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      10 non-null     float64
 1   age     11 non-null     float64
 2   sex     9 non-null      float64
 3   gender  10 non-null     object 
dtypes: float64(3), object(1)
memory usage: 512.0+ bytes


Unnamed: 0,ID,age,sex,gender
0,1.0,22.0,1.0,Male
1,2.0,26.0,1.0,Male
2,3.0,37.0,0.0,Female
3,,20.0,1.0,Male
4,4.0,56.0,0.0,Female
5,5.0,-1.0,,
6,6.0,79.0,99.0,Unknown
7,7.0,64.0,,Not Available
8,8.0,-35.0,1.0,Male
9,,,,


In addition to reporting the variable names and data types, the [info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) method reports the number of observations for each variable that are not `NA` values, that is, `None` or `NaN`. Despite the data set having integer data for the variables *ID*, *age*, *sex*, pandas reports these variables as having a `float64` data type. This is because of the presence of missing values for these variables. By default, pandas uses `NaN` to represent missing values and Python defines `NaN` to be a float. The variables *ID*, *age*, *sex* can be cast to a special [nullable integer data type](https://pandas.pydata.org/docs/user_guide/integer_na.html#integer-na) but this feature is still experimental in pandas so we won't use it here.

The data set contains the following errors:
1. Values for variable *age* outside of the range [0, 120].
2. Variable *sex* not in {0,1}.
3. Values for variable *gender* include "Unknown" and "Not Available".
4. Observation 10 has `gender=Male` but `sex=0` which seems contrary to the coding.
5. Observation 3 is missing the identifier variable *ID*.
6. Observation 9 is a blank row.

### Removing values

There are many ways to clean this data. Below is one way of how to fix the errors 1-4.

In [3]:
dummy.loc[dummy["age"]<0, "age"] = None
dummy.loc[dummy["age"]>120, "age"] = None
dummy.loc[~dummy["sex"].isin([0,1]), "sex"] = None
dummy.loc[~dummy["gender"].isin(["Female", "Male"]), "gender"] = None
dummy.loc[(dummy["sex"]==0) & (dummy["gender"]=="Male"), "gender"] = \
    "Female"
dummy

Unnamed: 0,ID,age,sex,gender
0,1.0,22.0,1.0,Male
1,2.0,26.0,1.0,Male
2,3.0,37.0,0.0,Female
3,,20.0,1.0,Male
4,4.0,56.0,0.0,Female
5,5.0,,,
6,6.0,79.0,,
7,7.0,64.0,,
8,8.0,,1.0,Male
9,,,,


Note how the numerical variables always use `NaN` to represent missing values while variables with an `object` data type, such as *gender*, use either `None` or `NaN` and depend upon the value given.

### Removing rows

There are now two rows that do not have a value for *ID*.

In [4]:
dummy[dummy["ID"].isna()]

Unnamed: 0,ID,age,sex,gender
3,,20.0,1.0,Male
9,,,,


The latter obviously needs to be removed but it's unclear as to whether observation 3 should be removed as there are some data in the columns for this row.

To remove these rows based on the missing values for *ID* we can use the [dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method with parameters `axis='index'` and `subset=["ID"]`. To make the change to this data frame, and not have a new data frame created, we need to set the parameter `inplace=True`. It is important to remember that this is destructive. We are potentially "throwing away" data if observation 3 does represent a participant in our sample who is male and 20 years of age. It may be safer to retain this observation, however, depending on the type of analysis being done on the data set later this observation may be dropped anyway if values for *ID* are required.

In [5]:
dummy.dropna(axis='index', subset=["ID"], inplace=True)
dummy

Unnamed: 0,ID,age,sex,gender
0,1.0,22.0,1.0,Male
1,2.0,26.0,1.0,Male
2,3.0,37.0,0.0,Female
4,4.0,56.0,0.0,Female
5,5.0,,,
6,6.0,79.0,,
7,7.0,64.0,,
8,8.0,,1.0,Male
10,9.0,27.0,0.0,Female
11,10.0,,0.0,Female


### Removing variables

The variables *sex* and *gender* now contain equivalent data. In this situation we may decide to remove the unnecessary *gender* variable and only keep the numeric variable *sex*. We can do this using the [drop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) method with the parameter `columns=["gender"]`.

In [6]:
dummy.drop(columns=["gender"], inplace=True)

We could have also used the [drop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) method with the parameter `index=[3,9]` to drop the rows we removed above.

Finally, it may be convenient to reset the index for the data frame using the [reset_index()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) method. To prevent the old index being inserted into the data frame set the parameter `drop=True`.

In [7]:
dummy.reset_index(drop=True, inplace=True)
dummy

Unnamed: 0,ID,age,sex
0,1.0,22.0,1.0
1,2.0,26.0,1.0
2,3.0,37.0,0.0
3,4.0,56.0,0.0
4,5.0,,
5,6.0,79.0,
6,7.0,64.0,
7,8.0,,1.0
8,9.0,27.0,0.0
9,10.0,,0.0
