## Data Manipulation with Pandas

In data science problems, you often need to deal with arrays containing data of different types, for instance

|Subject|Age|Sex|Height (m)|
--------|---|---|-------|
|Subject 1| 45 | M | 1.85|
|Subject 2| 25 | M | 1.65|
|Subject 3| 60 | F | 1.50|
|Subject 4| 30 | F | 1.62|

This data contains strings, integers, booleans (M or F), and floats. As discussed, you cannot use NumPy arrays directly to manipulate this data, as the arrays must contain a single data type. 

Pandas builds on NumPy and allows us to organize and manipulate tabular data. Two important Pandas objects help you organize your data:
- `Pandas Series`
- `Pandas DataFrames`


Pandas is likely the most popular library for tabular data manipulation, but there are more modern alternatives that offer multiple improvements. Some of these alternatives include
- Polars -- Better for manipulation of large datasets
- cuDF -- DataFrames in the GPU
- Dask -- Parallel processing

## Series

In [2]:
import pandas as pd #this is a common convention, 
import numpy as np #start with importing numpy
np.set_printoptions(legacy='1.25') #this makes it easy to visualize some results, it is optional

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

A pandas `Series` wraps both a sequence of values and a sequence of indices, which we can access with the `values` and `index` attributes. The values are a NumPy array.

In [4]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [5]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [6]:
data[1]

0.5

In [7]:
data[0:3]

0    0.25
1    0.50
2    0.75
dtype: float64

You can also create your own index

In [9]:
data = pd.Series(data = [0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

and use the index to access the values

In [12]:
data['a']

0.25

## DataFrames 

A DataFrame can be seen as a generalized two dimensional array or specialized dictionaries. 

In [13]:
#Series can be created directly from a dictionary 

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [14]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

A DataFrame can be constructed as a two dimensional object containing all this information

In [15]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


The DataFrame now has `index`, indicating the identifier of each row, and has `columns`, which is an index linked to the columns in the table

In [16]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [17]:
states.columns

Index(['population', 'area'], dtype='object')

In [18]:
#you can use the index to access the data in the dataframe 
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

In [19]:
states.loc['California']

population    38332521
area            423967
Name: California, dtype: int64

In [20]:
#You can also add new columns to the dataFrame once created
states['density'] = states['population']/states['area']
states['largeDensity'] = states['density']>100 

In [21]:
states

Unnamed: 0,population,area,density,largeDensity
California,38332521,423967,90.413926,False
Texas,26448193,695662,38.01874,False
New York,19651127,141297,139.076746,True
Florida,19552860,170312,114.806121,True
Illinois,12882135,149995,85.883763,False


## Not A Number or Missing Data

Often (very often) datasets contain Not A Number (NaN) or missing data. Pandas offers some elegant ways to deal with this situation that can save time and effort.

In [25]:
#create a DataFrame with missing data 

df = pd.DataFrame(data = [[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [26]:
#you can easily drop all the rows AND columns with missing data
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [27]:
# alternatively, you can remove only the row OR columns with missing data
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


In [28]:
# You can also remove only the rows or columns that have a certain amount of missing data 
df.dropna(axis='columns', thresh = 2)

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [32]:
# You can also fill in the missing data with different methods

# for example, with a number
df.fillna(0)

Unnamed: 0,0,1,2
0,1.0,0.0,2
1,2.0,3.0,5
2,0.0,4.0,6


In [33]:
#or with the value next to it
df.ffill()

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,2.0,4.0,6


In [34]:
#or with the value next to it
df.bfill()

Unnamed: 0,0,1,2
0,1.0,3.0,2
1,2.0,3.0,5
2,,4.0,6


In [35]:
#or the mean by column
df.fillna(df.mean(numeric_only=True))

Unnamed: 0,0,1,2
0,1.0,3.5,2
1,2.0,3.0,5
2,1.5,4.0,6


In [36]:
#or by row (note that this is a bit more complex than by column)
df.T.fillna(df.mean(numeric_only=True,axis=1), axis=0).T

Unnamed: 0,0,1,2
0,1.0,1.5,2.0
1,2.0,3.0,5.0
2,5.0,4.0,6.0


## Concat

DataFrames can be concatenated together to create larger dataframes 


In [37]:
#we will use this function to show different dataframes in the same view
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

#we will use this function to create dataframes for demonstrating difference concepts
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)


In [38]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')

Unnamed: 0,A,B
1,A1,B1
2,A2,B2

Unnamed: 0,A,B
3,A3,B3
4,A4,B4

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


By default, the concatenation takes place row-wise within the DataFrame (i.e., axis=0). But you can specify the axis.

In [39]:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis=1)")

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,C,D
0,C0,D0
1,C1,D1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1


In the previous examples, we had to manipulate the index manually. This is not very convenient. 

We can ask pandas to reset the index

In [42]:
df1 = make_df('AB', [0,1])
df2 = make_df('AB', [0,1])
display('df1', 'df2', 'pd.concat([df1, df2], ignore_index=True)')

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A0,B0
3,A1,B1


Or we can create a new index, called multi-index

In [43]:
df1 = make_df('AB', [0,1])
df2 = make_df('AB', [0,1])
display('df1', 'df2', "pd.concat([df1, df2], keys=['Group_1', 'Group_2'])")

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,Unnamed: 1,A,B
Group_1,0,A0,B0
Group_1,1,A1,B1
Group_2,0,A0,B0
Group_2,1,A1,B1


You can also concatenate dataframes with unpaired columns. Pandas will fill the missing data with Not a Number

In [44]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')

Unnamed: 0,A,B,C
1,A1,B1,C1
2,A2,B2,C2

Unnamed: 0,B,C,D
3,B3,C3,D3
4,B4,C4,D4

Unnamed: 0,A,B,C,D
1,A1,B1,C1,
2,A2,B2,C2,
3,,B3,C3,D3
4,,B4,C4,D4


or you can force pandas to merge the columns