## Data Manipulation with Pandas

In data science problems, you often need to deal with arrays containing data of different types, for instance

|Subject|Age|Sex|Height|
--------|---|---|-------|
|Subject 1| 45 | M | 1,85|
|Subject 2| 25 | M | 1,65|
|Subject 3| 60 | F | 1,50|
|Subject 4| 30 | F | 1,62|

This data contains strings, integers, booleans (M or F), and floats. As discussed, you cannot use NumPy arrays directly to manipulate this data, as the arrays must contain a single data type. 

Pandas builds on NumPy and allows us to organize and manipulate tabular data. Two important Pandas objects help you organize your data:
- `Pandas Series`
- `Pandas DataFrames`


Pandas is likely the most popular library for tabular data manipulation, but there are more modern alternatives that offer multiple improvements. Some of these alternatives include
- Polars -- Better for manipulation of large datasets
- cuDF -- DataFrames in the GPU
- Dask -- Parallel processing

## Series

In [None]:
import pandas as pd #this is a common convention, 
import numpy as np #start with importing numpy
np.set_printoptions(legacy='1.25') #this makes it easy to visualize some results, it is optional

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

A pandas `Series` wraps both a sequence of values and a sequence of indices, which we can access with the `values` and `index` attributes. The values are a NumPy array.

In [None]:
data.values

In [None]:
data.index

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [None]:
data[1]

In [None]:
data[0:3]

You can also create your own index

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

and use the index to access the values

In [None]:
data['b']

## DataFrames 

A DataFrame can be seen as a generalized two dimensional array or specialized dictionaries. 

In [None]:
#Series can be created directly from a dictionary 

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

A DataFrame can be constructed as a two dimensional object containing all this information

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

The DataFrame now has `index`, indicating the identifier of each row, and has `columns`, which is an index linked to the columns in the table

In [None]:
states.index

In [None]:
states.columns

In [None]:
#you can use the index to access the data in the dataframe 
states['population']

In [None]:
states.loc['California']

In [None]:
#You can also add new columns to the dataFrame once created
states['density'] = states['population']/states['area']
states['largeDensity'] = states['density']>100 

In [None]:
states

## Not A Number or Missing Data

Often (very often) datasets contain Not A Number (NaN) or missing data. Pandas offers some elegant ways to deal with this situation that can save time and effort.

In [None]:
#create a DataFrame with missing data 

df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

In [None]:
#you can easily drop all the rows AND columns with missing data
df.dropna()

In [None]:
# alternatively, you can remove only the row OR columns with missing data
df.dropna(axis='columns')

In [None]:
# You can also remove only the rows or columns that have a certain amount of missing data 
df.dropna(axis='columns', thresh = 2)

In [None]:
# You can also fill in the missing data with different methods

# for example, with a number
df.fillna(0)

In [None]:
#or with the value next to it
df.ffill()

In [None]:
#or with the value next to it
df.bfill()

In [None]:
#or the mean by column
df.fillna(df.mean(numeric_only=True))

In [None]:
#or by row (note that this is a bit more complex than by column)
df.T.fillna(df.mean(numeric_only=True,axis=1), axis=0).T

## Concat

DataFrames can be concatenated together to create larger dataframes 


In [None]:
#we will use this function to show different dataframes in the same view
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

#we will use this function to create dataframes for demonstrating difference concepts
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)


In [None]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')

By default, the concatenation takes place row-wise within the DataFrame (i.e., axis=0). But you can specify the axis.

In [None]:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis=1)")

In the previous examples, we had to manipulate the index manually. This is not very convenient. 

We can ask pandas to reset the index

In [None]:
df1 = make_df('AB', [0,1])
df2 = make_df('AB', [0,1])
display('df1', 'df2', 'pd.concat([df1, df2], ignore_index=True)')

Or we can create a new index, called multi-index

In [None]:
df1 = make_df('AB', [0,1])
df2 = make_df('AB', [0,1])
display('df1', 'df2', "pd.concat([df1, df2], keys=['Group_1', 'Group_2'])")

You can also concatenate dataframes with unpaired columns. Pandas will fill the missing data with Not a Number

In [None]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')

or you can force pandas to merge the columns