# Pandas
## 1. Introduction
Pandas is the number one Python-tool kit for data analysis and manipulation.
## 2 Series data structure
### 2.1 Creating Series
The series data structure can be seen as a one dimentional array wihtin the Pandas library. It built upon the numpy array which boosts performance.

In [2]:
import pandas as pd
import numpy as np
import math 

In [None]:
numbers = [2 ** 5, 2**6, 2 ** 7]
pd.Series(numbers) # pandas infer the type automatically

One desirable feature of Pandas is type deduction. if a sequence like object is passed to pd.Series(), then the ***None*** type would be converted to a special float value: ***Nan***: not a number if the elements are numerical.

It is important to keep in mind that the value ***Nan*** is not, by any means, equivalent to the value ***None***. 

In [None]:
print(np.nan == None) 
print(np.nan == np.nan)

The value ***Nan*** is built differently in the computer for efficiency reasons. It can be tested using the np.isnan() function.

In [None]:
print(np.isnan(np.nan))

Going back to pandas Series, it is a mixture of lists and dictionaries. the indexes can be either numbers or other values.

In [None]:
s1 = pd.Series([5,6,7,8,9])
s2 = pd.Series({"0": 1, "1": 2, "2": 3, "3": 4})

print(s1)
# print("#########################################################################################")
print(s2)
print(s2["1"])
print(s2.index)

In [None]:
s3_index = ["sxx", "swz", "mmu"]
s3 = pd.Series([3,4,5], index=s3_index) # the length of the parameters and the passed index must match
print(s3)

In [None]:
scores = {"A+": 95, "A": 90, "B": 75, "C": 60, "D": 50} # we can see that the index keyword argument overwrites the dictionary construction operator
s = pd.Series(scores, index=['A', "B+", 'B', 'C', 'D', 'F'])
print(s)

### 2.2 Querying Pandas Series
creating the series is one step in the data analysis progress. Querying is the real deal

In [None]:
print(s[3] == s['C'])
print(s.iloc[3] == s[3] == s.loc['C'] == s['C'])
# the safest option is to use the loc and iloc operators for example
s_exp = pd.Series({99:"oh", 100:"shit", 101:"damn"})
try:
    s_exp[0]
except KeyError:
    print("You should use the iloc attribute. Here give it a try:")
    print(s_exp.iloc[0])

It is important to keep in mind that built-in methods make use of vectorization which enables parallelly distributed processing leading to dramatic speed ups.

In [None]:
%timeit -n 100
# this Python feauture is called magic functions: it will run the cell on many iterations and estimate the average execution time

series = pd.Series(np.random.randint(0,5000, 1000))
## The slow approach
total = 0
for val in series:
    total += val
print(total / len(series))


In [None]:
%timeit -n 100
# the vectorized: fast approach
print(series.mean())

In [None]:
## using the same principle, we can manipulate the series as if they were numpy array to make use of the vectorization
n = np.random.randint(-500, 500)
print(n)
print(series.head(10)) # print the first 10 elements
series_plus_n = series + n 
print(series_plus_n.head(10))
series_plus_n -= n

In [None]:
# the other approach would be to iterate and set the values manually: the slow approach
for label, value in series_plus_n.iteritems():
    series_plus_n.loc[label] = value + n
print(series_plus_n.head(10))

## 3 DataFrame
pd.DataFrame is the core of the pandas' framework.
### 3.1 Initialization

In [3]:
## out of series
s1 = pd.Series({"Name": 'a', "Class": "Phy", "Score": 85})
s2 = pd.Series({"Name": 'b', "Class": "Chem", "Score": 90})
s3 = pd.Series({"Name": 'c', "Class": "Phy", "Score": 88})
df1 = pd.DataFrame([s1, s2, s3], index=['s1', 's2', 's3'])
print(df1)
# pandas dataframe can be made out of pandas series: each constituting a row 

## directly from list of dictionaries
print("\n","\n",sep="HERE IS AN IDENTICAL DATAFRAME")
df2 = pd.DataFrame([{"Name": 'a', "Class": "Phy", "Score": 85}, 
{"Name": 'b', "Class": "Chem", "Score": 90}, 
{"Name": 'c', "Class": "Phy", "Score": 88}], index=['s1', 's2', 's3'])
print(df2)

   Name Class  Score
s1    a   Phy     85
s2    b  Chem     90
s3    c   Phy     88

HERE IS AN IDENTICAL DATAFRAME

   Name Class  Score
s1    a   Phy     85
s2    b  Chem     90
s3    c   Phy     88


### 3.2 Accessing and dropping columns and rows

In [None]:
## Having the series initialization approach in mind, We can see how the loc and iloc operators work.
s4 = df1.loc['s1'] # it will return a row
print(s4)  
try: 
    df1.loc['Name']
except:
    print("WE can't use loc exclusively with columns")

In [None]:
## to select based on columns:
name_col = df1['Name'] # pd.Series object
print(type(name_col))
name_col_df = df1[['Name']] # pd.DataFrame object
print(type(name_col_df))
mul_col = df1[['Name', 'Class']] # it can only be a dataFrame object
print(mul_col)

In [None]:
# the best approach is to use loc and iloc
print(df1.loc[['s1', 's2'], ['Name', 'Score']], "\n") # the names and scores of s1 and s2
print(df1.loc[:, 'Name'], type(df1.loc[:, 'Name']), sep="\n") # the names of all rows

In [None]:
# Deleting can be tricky
df_copy = df1.copy() # making a copy

df_copy.drop("s1", inplace=True, axis=0) # axis=1 means the columns, axis=0 indicates the rows
print(df_copy)
# adding is just simple
df_copy['another_col'] = [1,2] # the length of the column should match the length of the index attribute
print(df_copy)

### 3.3 Loading and Indexing data

In [None]:
### loading  data
df = pd.read_csv("utility_files/titatic_comp_train.csv", index_col=0) # setting the id to be the index
print(df.head())
# certain columns might not be of lcear meaning , thus renaming them might reveal necessary
print("\n", "\n", sep="After name modifiction")
new_df = df.rename(columns={"PassengerId": "", "SibSp": "num_siblings_spouses", "parch": "num_parent_child"})
print(new_df.head())


In [None]:
# we can see that the name change only if the values are identical in the dictionary. Thus, it might easily raise errors
# a better approach is to apply a certain function on each column name: stripping white spaces, convert to lower case, upper case
# or even capatilizing

new_df  = new_df.rename(mapper=str.strip, axis=1)
print(new_df.head())
# we can see through that approach that only one function can be applied at a time, which might not be efficient
print("\n", "\n", sep="The better approach")
mapper = {"Sibsp": "Num_siblings_spouses", "Parch": "Num_parent_child"}
new_cols = [col.strip().lower().capitalize() for col in new_df.columns]
new_cols = [mapper[col] if col in mapper else col for col in new_cols]
new_df.columns = new_cols
print(new_df.head(10))

### 3.4 Querying data 
Among the most important data manipulation techinques in the boolean masking applied both to pandas data structures and numpy arrays

In [None]:
# choose only the passengers from the 2nd or upper classes
non_poor_mask = new_df['Pclass'] <= 2
# print(non_poor_mask)
non_poor_df = new_df.where(non_poor_mask)
print(non_poor_df.loc[:,["Pclass"]]) 
# the rows that do not meet the condition set by the boolean mask are not dropped out of the table.
# instead they are set as Nan (for numerical values) and None for Object typed values
# Thus a finalized command is:
non_poor_df = new_df.where(non_poor_mask).dropna()
print(non_poor_df.loc[:, ["Pclass"]])

In [None]:
# More systactic sugar:
non_poor_df_2 = new_df[new_df['Pclass'] <= 2]
print(non_poor_df_2.loc[:, ["Pclass"]])

The previous examples considered only a single filtering condition. This scenarion is quite unlikely in practice.

In [None]:
non_poor_with_relatives_df = new_df[(new_df['Pclass'] <= 2) & (new_df['Num_siblings_spouses'] > 0)] 
# for more than one condition the syntax might end up a bit more complicated
print(non_poor_with_relatives_df.loc[:, ['Pclass', 'Num_siblings_spouses']])
# it might be a good idea to use an independent boolean mask for complicated conditions the pass it directly to the DataFrame in question
rich_with_no_relatives_mask = (new_df["Pclass"] == 1) & (new_df['Num_siblings_spouses'] == 0) # the parentheses are needed 
print(new_df[rich_with_no_relatives_mask].loc[:, ["Pclass", 'Num_siblings_spouses']])


### 3.5 Indexing

In [None]:
# we can set the index of the DataFrame either from initialization by setting the corresponding values for the parameters 
# or later using a number of methods

df = pd.read_csv("utility_files/titatic_comp_train.csv") 
# print(df.head())
df2 = df.set_index("PassengerId")
print(df2.head(), "\n", sep="\n")

df3 = df2.reset_index() # now the index is the default numerical series starting from 0
print(df3.head())


In [None]:
# we can have composite indices as follows:
df4 = df.set_index(['Pclass', 'PassengerId'])
print(df4)

### 3.6 Missing Data
Pandas.DataFrame is equipped with efficient features to handle missing data.

In [None]:
# since the data set is already clean, let's set some values to be None
n_rows = len(df.index)
n_cols = len(df.columns)

for _ in range(400):
    df.iloc[np.random.randint(0, n_rows), np.random.randint(0, n_cols)] = None
# we can create a mask for missing values as follows:
is_null_mask = df.isnull()
# print(is_null_mask.head(100))

no_na_df = df.dropna()
# filling the missing values instead of remvoving them
df_ffill = df.fillna(method='ffill') # this method fills na values with the directly next valid one
print(df_ffill) 

For the method ***pd.fillna()*** it is better to refer to the [documention](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) 

This [link](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) is for the ***pd.replace()*** method.

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj1 = pd.Series(sdata)
print(obj1)
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj2 = pd.Series(sdata, index=states)
print(obj2)
obj3 = pd.isnull(obj2)


In [None]:

x = obj2['California']
print(obj2['California'] != x)

print(obj2['California'] == None) 
print(obj3['California'])
math.isnan(obj2['California'])

In [None]:
df_np = pd.DataFrame(np.array([[1,2],[2,3]]))
df_np = pd.DataFrame(pd.Series([1,2,3]))
print(df_np)

In [None]:
print(df1)

In [None]:
df1 = df1.drop("Name", axis=1)

s4 = pd.Series(['d', 'Shit', 120])
s5 = pd.Series(['e', 'Shit2', 106])
s6 = pd.Series(['f', 'Shit3', 114])
s7 = pd.Series(['g', 'Shit4', 112])
s4 = pd.Series()
df2 = pd.DataFrame([s4, s5, s6, s7], columns=["Name", "Class", "Score"])
print(df2)
df1 = pd.concat([df1, df2],ignore_index=True, axis=0)
print(df1)
print(df1[df1['Score'].gt(105) & df1['Score'].lt(115)])