# Data Analysis

El principal objetivo de esta clase es introducir la biblioteca de análisis de datos __pandas__, la cual agrega mayor flexibilidad y opciones que Numpy, sin embargo el costo de esto se traduce en pérdida de rendimiento y eficiencia. Conocer los elemnos básicos pandas permite manejar desde pocos dados en un arhivo _excel_ hasta miles y millones de registros de una base de datos.

blabla pandas

In [1]:
import numpy as np
import pandas as pd

from pathlib import Path

In [4]:
data_filepath = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
# data_filepath = Path().resolve().parent / "data" / "breast-cancer-wisconsin.data"
data_filepath

'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'

In [12]:
breast_cancer_data = pd.read_csv(
    data_filepath ,
    names=[
        "code",
        "clump_thickness",
        "uniformity_cell_size",
        "uniformity_cell_shape",
        "marginal_adhesion",
        "single_epithelial_cell_size",
        "bare_nuclei",
        "bland_chromatin",
        "normal_cucleoli",
        "mitoses",
        "class",
    ],
    index_col=0
)
breast_cancer_data.head()

Unnamed: 0_level_0,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_cucleoli,mitoses,class
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2


## Series

Series are one-dimensional labeled arrays. You can think they are similar to columns of a excel spreadsheet. 

There are multiple ways to create a `pd.Series`, using lists, dictionaies, `np.array` or from a file. 

Since we already loaded the breast cancer data we will use it as an example. Each list of this file has been converted to a `pd.Series`.

In [None]:
clump_thick_series = breast_cancer_data["clump_thickness"]
clump_thick_series.head()

In [None]:
type(clump_thick_series)

`pd.Series` are made with _index_ and _values_.

In [None]:
clump_thick_series.index

In [None]:
clump_thick_series.values

Now, imagine you want to access to a specific value from the third patient.

In [None]:
clump_thick_series.iloc[2]  # Remember Python is a 0-indexed progamming language.

However what if you want to know the clump thickness of a specific patient. Since we have their codes we can access with another method.

For example, for patient's code `1166654`

In [None]:
clump_thick_series.loc[1166654]

Don't forget

* `loc` refers to indexes (__labels__).
* `iloc` refers to order.

We will focus on `loc` instead of `iloc` since the power of `pandas` comes from its indexes can be numeric or categoricals. If you only need to do order-based analysis `pandas` could be overkill and `numpy` could be enough.

What if you want to get the values of several patients? For example patients `1166654` and `1178580`

In [None]:
clump_thick_series.loc[[1166654, 1178580]]

```{important}
Notice if the argument is just one label the `loc` returns only the value. On the other hand, if the argument is a list then `loc` returns a `pd.Series` object.
```

In [None]:
type(clump_thick_series.loc[1166654])

In [None]:
type(clump_thick_series.loc[[1166654, 1178580]])

You can even edit or add values with these methods.

For instance, what if the dataset is wrong about patient `1166654` and clump thickness should have been `4` instead of `5`? 

We can fix that easily.

In [None]:
clump_thick_series.loc[1166654] = 6

Youe may got a `SettingWithCopyWarning` message after running the last code cell. I would suggest you to read the link cited after that warning. But in simple words, `loc` returns a __view__, that means if you change anything it will change the main object itself. This is a feature, not an error. We have to be careful with this in the future.

Ok, let's check that change we made

In [None]:
clump_thick_series.loc[1166654]

Another common mask is when you want to filter by a condition.

For example, let's get all the patients with a clump thickness greater than 7.

In [None]:
clump_thick_series > 7

You can do logical comparations with `pd.Series` but this only will return another `pd.Series` of boolean objects (True/False). We want to keep only those ones where the value is true.

In [None]:
clump_thick_series.loc[clump_thick_series > 7]

You can avoid using `loc` in this task but to be honest I rather use it. 

In [None]:
clump_thick_series[clump_thick_series > 7]

However, my favorite version is using a functional approach with the function `lambda`. It is less intuitive at the beginning but it allows you to concatenate operations.

In [None]:
clump_thick_series.loc[lambda x: x > 7]

## DataFrames

`pd.DataFrame` are 2-dimensional arrays with horizontal and vertical labels (_indexes_ and _columns_). It is the natural extension of `pd.Series` and you can even think they are a multiple `pd.Series` concatenated.

In [None]:
type(breast_cancer_data)

There are a few useful methods for exploring the data, let's explore some of them.

In [None]:
breast_cancer_data.head()

In [None]:
breast_cancer_data.tail()

In [None]:
breast_cancer_data.shape

In [None]:
breast_cancer_data.info()

In [None]:
breast_cancer_data.dtypes

In [None]:
breast_cancer_data.describe()

In [None]:
breast_cancer_data.describe(include="all")

In [None]:
breast_cancer_data.max()

In [None]:
breast_cancer_data.mean()

In [None]:
breast_cancer_data.loc[1166654]

In [None]:
breast_cancer_data.loc[1166654].loc["clump_thickness"]

In [None]:
breast_cancer_data.loc[1166654, "clump_thickness"]

In [None]:
breast_cancer_data.loc[1166654, ["clump_thickness", "class"]]

In [None]:
breast_cancer_data.loc[[1166654, 1178580], ["clump_thickness", "class"]]

In [None]:
breast_cancer_data.loc[lambda x: x["clump_thickness"] > 7]

In [None]:
breast_cancer_data.loc[:, "bare_nuclei"]

In [None]:
breast_cancer_data.loc[:, "bare_nuclei"].value_counts()

In [None]:
breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?']

In [None]:
breast_cancer_data.loc[lambda s: s['bare_nuclei'] == '?',  'bare_nuclei'] = pd.NA

In [None]:
breast_cancer_data.loc[lambda s: s['bare_nuclei'] == pd.NA]

In [None]:
breast_cancer_data.loc[lambda s: s['bare_nuclei'].isnull()]

In [None]:
breast_cancer_data.isnull()

In [None]:
breast_cancer_data.isnull().any()

In [None]:
breast_cancer_data.isnull().any(axis=1)

In [None]:
breast_cancer_data.fillna?

In [None]:
pd.to_numeric(breast_cancer_data['bare_nuclei'])

In [None]:
breast_cancer_data['bare_nuclei'] = pd.to_numeric(breast_cancer_data['bare_nuclei'])

In [None]:
breast_cancer_data['bare_nuclei'].isnull().any()

In [None]:
breast_cancer_data['bare_nuclei'].mean()

In [None]:
import numpy as np

In [None]:
bare_nuclei_mean = np.ceil(breast_cancer_data['bare_nuclei'].mean())
bare_nuclei_mean

In [None]:
breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean})

In [None]:
breast_cancer_data.isnull().any()

What?

In [None]:
breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean}, inplace=True)

In [None]:
breast_cancer_data.isnull().any()

In [None]:
breast_cancer_data = breast_cancer_data.fillna(value={'bare_nuclei': bare_nuclei_mean})

### Summary

blablabl