<a href="https://colab.research.google.com/github/axel-sirota/manage-data-pandas/blob/main/module5/ManageDataPandas_Mod5Demo1_ApplyInvalid.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using apply on invalid data


## Prep

In the last series of demos we worked with duplicated rows, now we are going to work with another troublesome issue of datasets: invalid data. For this we will work on another dataset of sensor measurements.

In [None]:
%%writefile get_data.sh
if [ ! -f measurements_invalid.csv ]; then
  wget -O measurements_invalid.csv https://raw.githubusercontent.com/axel-sirota/manage-data-pandas/main/data/measurements_invalid.csv
fi

Writing get_data.sh


In [None]:
!bash get_data.sh

--2023-04-25 13:54:17--  https://raw.githubusercontent.com/axel-sirota/normalise-data-pandas/main/data/measurements_invalid.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 391424 (382K) [text/plain]
Saving to: ‘measurements_invalid.csv’


2023-04-25 13:54:18 (10.4 MB/s) - ‘measurements_invalid.csv’ saved [391424/391424]



In [None]:
import numpy as np
import pandas as pd

measurements =  pd.read_csv('measurements_invalid.csv')
measurements

Unnamed: 0,sensor,date,measurement
0,temperature,2023-04-25 13:53:31.759460,40
1,humidity,2023-04-25 13:53:31.759470,26
2,temperature,2023-04-25 13:53:31.759473,196
3,temperature,2023-04-25 13:53:31.759475,2
4,temperature,2023-04-25 13:53:31.759477,10
...,...,...,...
9995,weight,2023-04-25 13:53:31.780708,60
9996,temperature,2023-04-25 13:53:31.780710,134
9997,weight,2023-04-25 13:53:31.780712,33
9998,humidity,2023-04-25 13:53:31.780713,77


## Detecting invalid data

Dealing with invalid data is one of the most difficult parts of data wrangling. You need to understand if that error is something tractable or not. One easy way to detect it is see if you can run the mean function on each column.

In [None]:
measurements.set_index('sensor').mean()

  measurements.set_index('sensor').mean()


Series([], dtype: float64)

In [None]:
measurements[measurements.sensor == 'temperature'].mean()

  measurements[measurements.sensor == 'temperature'].mean()


Series([], dtype: float64)

Notice the errors? They are being camuflaged! Lets run an apply and try to see where it fails

In [None]:
running_sum = 0
def func(x):
  global running_sum
  running_sum += x

measurements.sensor.apply(func)

TypeError: ignored

There we go! We have some nice strings in here. Also we can get info from the `info` method

In [None]:
measurements.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   sensor       10000 non-null  object
 1   date         10000 non-null  object
 2   measurement  10000 non-null  object
dtypes: object(3)
memory usage: 234.5+ KB


The trick to remove invalid, or detect it, is to use the `to_numeric` method, which will try to convert supposedly numeric fields into numeric types. The second trick is to set errors to `coerce` such that failures get converted to NaNs; and we know how to handle those. 

In [None]:
pd.to_numeric(measurements['measurement'], errors='coerce')

0        40.0
1        26.0
2       196.0
3         2.0
4        10.0
        ...  
9995     60.0
9996    134.0
9997     33.0
9998     77.0
9999     26.0
Name: measurement, Length: 10000, dtype: float64

In [None]:
sum(pd.to_numeric(measurements['measurement'], errors='coerce').isna())

935

We can see we have around 935 troublesome values. Let's remove them

In [None]:
measurements_filtered = measurements[~pd.to_numeric(measurements['measurement'], errors='coerce').isna()]

Now we can convert those measurements to numeric!

In [None]:
measurements_filtered['measurement'] = measurements_filtered.measurement.astype('int32')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  measurements_filtered['measurement'] = measurements_filtered.measurement.astype('int32')


In [None]:
measurements_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9065 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   sensor       9065 non-null   object
 1   date         9065 non-null   object
 2   measurement  9065 non-null   int32 
dtypes: int32(1), object(2)
memory usage: 247.9+ KB


In [None]:
measurements_filtered.measurement.describe()

count    9065.000000
mean      787.018974
std      1507.348461
min       -10.000000
25%        48.000000
50%        97.000000
75%       193.000000
max      5991.000000
Name: measurement, dtype: float64

There you go!