7.1 - 
float64 = numbers that have decimals.
NaN = Not a Number
Na = Not avaliable in R
None is also NA in built in python

In [1]:
import numpy as np
import pandas as pd

In [2]:
#We call this a sentinel value: when present, it indicates a missing (or null) value:
float_data = pd.Series([1.2, -3.5, np.nan, 0])

float_data 

# dtype: float64

0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

In [3]:
#isna is a Boolean Series with True where values are null
float_data.isna()
#dtype: bool

0    False
1    False
2     True
3    False
dtype: bool

In [11]:
#When cleaning up data for analysis, it is often important to do analysis on the missing data itself to identify data collection problems or potential biases in the data caused by missing data.
#The built-in Python None value is also treated as NA:
string_data = pd.Series(["aardvark", np.nan, None, "avocado"])
string_data
string_data.isna()

0    False
1     True
2     True
3    False
dtype: bool

In [10]:
float_data = pd.Series([1, 2, None], dtype='float64')
float_data
print("-------")
float_data.isna()

-------


0    False
1    False
2     True
dtype: bool

Filtering Out Missing Data
    - pandas.isna and Boolean indexing, dropna can be helpful. On a Series, it returns the Series with only the nonnull data and index values:
    - 

In [13]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()


0    1.0
2    3.5
4    7.0
dtype: float64

In [14]:
#or we can use notna to remove the na's
data[data.notna()]


0    1.0
2    3.5
4    7.0
dtype: float64

You may want to drop rows or columns that are all NA, or only those rows or columns containing any NAs at all. dropna by default drops any row containing a missing value:

In [24]:
#Example
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

data #using dropna by default drops any row containing missing values
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [25]:
#Passing how="all" will drop only rows that are all NA:
data.dropna(how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


Keep in mind that these functions return new objects by default and do not modify the contents of the original object.

To drop columns in the same way, pass axis="columns":

In [27]:
data.dropna(axis="columns", how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


Suppose you want to keep only rows containing at most a certain number of missing observations. You can indicate this with the thresh argument:

In [32]:
df = pd.DataFrame(np.random.standard_normal((7, 3)))
df.iloc[:4, 1] = np.nan

df.iloc[:2, 2] = np.nan

df

Unnamed: 0,0,1,2
0,0.942239,,
1,-0.235298,,
2,0.62674,,0.573266
3,-1.755751,,-0.272018
4,0.689196,-0.696573,-0.290748
5,-1.076741,-0.078781,-0.308893
6,0.273372,-0.314592,0.033229


In [33]:
# You can indicate this with the thresh argument:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.62674,,0.573266
3,-1.755751,,-0.272018
4,0.689196,-0.696573,-0.290748
5,-1.076741,-0.078781,-0.308893
6,0.273372,-0.314592,0.033229


Filling In Missing Data
    - Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways.
    - or most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value:

In [34]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.942239,0.0,0.0
1,-0.235298,0.0,0.0
2,0.62674,0.0,0.573266
3,-1.755751,0.0,-0.272018
4,0.689196,-0.696573,-0.290748
5,-1.076741,-0.078781,-0.308893
6,0.273372,-0.314592,0.033229


In [35]:
#Calling fillna with a dictionary, you can use a different fill value for each column:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,0.942239,0.5,0.0
1,-0.235298,0.5,0.0
2,0.62674,0.5,0.573266
3,-1.755751,0.5,-0.272018
4,0.689196,-0.696573,-0.290748
5,-1.076741,-0.078781,-0.308893
6,0.273372,-0.314592,0.033229


In [36]:
#The same interpolation methods available for reindexing (see ???) can be used with fillna:
df = pd.DataFrame(np.random.standard_normal((6, 3)))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df

df.fillna(method="ffill")
#fillna(method="ffill") will fill in the Nan with the last number appearing the the row


Unnamed: 0,0,1,2
0,0.417959,0.345362,-1.083
1,-0.344011,-1.219936,-0.179674
2,-0.145396,,1.160757
3,-0.50946,,0.884452
4,0.451508,,
5,-1.363385,,


In [None]:
df.fillna(method="ffill", limit=2)
#similar above but leaving only 2 rows with NaN

With fillna you can do lots of other things such as simple data imputation using the median or mean statistics:

In [37]:
data = pd.Series([1., np.nan, 3.5, np.nan, 7])

data.fillna(data.mean())


0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

7.2:Data Transformation
    - Removing Duplicates

In [39]:
#removing duplicates in a row
data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],
                     "k2": [1, 1, 2, 3, 3, 4, 4]})

data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


The DataFrame method duplicated returns a Boolean Series indicating whether each row is a duplicate (its column values are exactly equal to those in an earlier row) or not:

In [40]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [42]:
#Relatedly, drop_duplicates returns a DataFrame with rows where the duplicated array is False filtered out:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


Both methods by default consider all of the columns; alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates based only on the "k1" column:

In [44]:
data["v1"] = range(7) #adding a column data["v1"]
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [45]:
data.drop_duplicates(subset=["k1"])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


duplicated and drop_duplicates by default keep the first observed value combination. Passing keep="last" will return the last one:

In [46]:
data.drop_duplicates(["k1", "k2"], keep="last")

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


Transforming Data Using a Function or Mapping
   - transformation based on the values in an array, Series, or column in a DataFrame. Consider the following hypothetical data collected about various kinds of meat:

In [48]:
data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon",
                        "pastrami", "corned beef", "bacon",
                        "pastrami", "honey ham", "nova lox"],
                        "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:

In [50]:
meat_to_animal = {
  "bacon": "pig",
  "pulled pork": "pig",
  "pastrami": "cow",
  "corned beef": "cow",
  "honey ham": "pig",
  "nova lox": "salmon"
}

The map method on a Series, accepts a function or dictionary-like object containing a mapping to do the transformation of values:

In [52]:
data["animal"] = data["food"].map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon
