# Fundamentos librerías de Data Science para Python

# 1. Librerias

Son un bloque de código reusable el cual se puede importar y utilizar para aprovechar el trabajo
Librerias mas usadas en ciencia de datos:

+ **Manipulación de datos:**
    + NumPy 
    + SciPy
    + Pandas
    
    
+ **Modelos de Machine learning tradicional:**
    + SciKit-Learn
    
    
+ **Modelos de Deep learning:**
    + TensorFlow - Keras
    + Pytorch
    
    
+ **Visualización:**
    + Matplotlib
    + Seaborn
    + Plotly

Recordemos que todas las cosas en Python se pueden definir como objetos (string, enteros, listas, etc), cuando se usa una libreria vamos a poder crear otro tipo de objetos que no estan incluidos nativamente en Python

Estos objetos estan compuetos de métodos y tambien de atributos

![image.png](attachment:image.png)

In [1]:
import pandas as pd
import numpy as np

---

# 2. Numpy

Es una libreria open source numpy, esta estandarizado y optimizada para trabajar con datos númericos y poder hacer operaciones vectoriales y matriciales facilmente en python.

Crear objetos tipo ndarray (numpy dimensional array)

In [29]:
lista_de_edades = [22, 35, 12, 18, 23, 22, 22]
lista_de_edades

[22, 35, 12, 18, 23, 22, 22]

In [41]:
np_array_edades = np.array(lista_de_edades)
print(np_array_edades[::2])
a = {1,3,"ds"}
a

[22 12 23 22]


{1, 3, 'ds'}

In [4]:
a = np.array([1, 2, 3, 4, 5, 6])
a

array([1, 2, 3, 4, 5, 6])

![image.png](attachment:image.png)

In [5]:
np_array_1d = np.array([7,2,9,10])
np_array_1d

array([ 7,  2,  9, 10])

In [44]:
np_array_2d = np.array([ 
                        [1, 2, 3], 
                        [4, 5, 6]
                       ])
print(np_array_2d.shape)

(2, 3)


In [45]:
np_array_3d = np.array([ 
                        [[1, 2, 3], [4, 5, 6], [4, 5, 7]], 
                        [[7, 8, 9], [1, 4, 2], [4, 5, 6]],
                        [[7, 8, 7], [1, 3, 1], [4, 5, 6]]
                       
                       ])
print(np_array_3d.shape)

(3, 3, 3)


Diferencias con las listas de python, es mucho mas rapida y compacta

In [8]:
np_array_edades[1] = 22
np_array_edades

array([22, 22, 12, 18, 23, 22, 22])

---

**Algunos Atributos:**

In [20]:
np_array_edades.size

7

In [12]:
np_array_edades.dtype

dtype('int64')

In [15]:
np_array_3d.shape

(3, 3, 3)

In [16]:
np_array_3d.ndim

3

In [26]:
np_array_2d.shape

(2, 3)

In [25]:
np_array_2d.T

array([[1, 4],
       [2, 5],
       [3, 6]])

---

**Algunos métodos:**

In [21]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [22]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [23]:
np.arange(100)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [24]:
np.arange(100,200)

array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
       113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
       126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
       139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
       152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
       165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
       178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
       191, 192, 193, 194, 195, 196, 197, 198, 199])

In [19]:
np.arange(10,100,5)

array([10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90,
       95])

In [20]:
np.linspace(0, 4, num=10)

array([0.        , 0.44444444, 0.88888889, 1.33333333, 1.77777778,
       2.22222222, 2.66666667, 3.11111111, 3.55555556, 4.        ])

In [21]:
np_array_2d.reshape(6)

array([1, 2, 3, 4, 5, 6])

In [22]:
np_array_edades.max()

23

In [23]:
np_array_edades.min()

12

In [24]:
np_array_edades.sum()

141

In [25]:
np_array_edades.mean()

20.142857142857142

In [26]:
np_array_edades

array([22, 22, 12, 18, 23, 22, 22])

In [27]:
np_unique_values = np.unique(np_array_edades)

In [28]:
print(np.arange(50))

print(np.flip(np.arange(50)))

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49]
[49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26
 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2
  1  0]


---

**Algunas operaciones:**

[Documentación](https://numpy.org/doc/stable/reference/routines.math.html)

In [29]:
np_array_edades * 10

array([220, 220, 120, 180, 230, 220, 220])

In [30]:
np_array_edades + 20

array([42, 42, 32, 38, 43, 42, 42])

In [31]:
np.square(np_array_edades)

array([484, 484, 144, 324, 529, 484, 484], dtype=int32)

![image.png](attachment:image.png)

In [32]:
y1 = np.array([1,2,3])
print(y1)

y2 = np.array([1,1,1])
print(y2)

(np.sum(np.square(y1 - y2))) / y1.size

[1 2 3]
[1 1 1]


1.6666666666666667

---

# 3. Pandas

Es una libreria que permite la manipulación de datos en una forma tabular, es usada principalmente en ciencia de datos y permite crear objetos de dos tipos:

+ Series
+ Dataframe

![image.png](attachment:image.png)

Estos objetos una vez creados tienen una cantidad de metodos asociados los cuales facilitan realizar operaciones sobre los datos.

[Métodos de Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)


Permite leer datos de multiples archivos como:

+ CSV
+ XLSX
+ JSON 
+ Bases de datos SQL



![image.png](attachment:image.png)

**Pandas series**

In [33]:
lista_de_edades

[22, 35, 12, 18, 23, 22, 22]

In [34]:
series_de_edades = pd.Series(data = lista_de_edades)
series_de_edades

0    22
1    35
2    12
3    18
4    23
5    22
6    22
dtype: int64

In [35]:
series_de_edades.values

array([22, 35, 12, 18, 23, 22, 22], dtype=int64)

---

**Como leer un archivo con pandas:**

In [2]:
df_euro = pd.read_csv("euro2020.csv") 
df_euro.head()

Unnamed: 0,Player,Country,Position,Match played,Goals,Right foot goals,Left foot goals,Header goals,Assists,Total attempts,...,Top speed,Passing accuracy,Distance covered,Clearence attempted,Balls recovered,Tackels,Blocks,Yellow cards,Red cards,Minutes played
0,Cristiano Ronaldo,Portugal,F,4,5,3.0,2.0,0.0,1.0,16.0,...,29.7,86,36.1,2.0,2.0,1.0,,0.0,0.0,360
1,Patrik Schick,Czech Republic,F,5,5,1.0,3.0,1.0,0.0,16.0,...,32.4,71,43.6,4.0,6.0,3.0,,0.0,0.0,404
2,Karim Benzema,France,F,4,4,2.0,2.0,1.0,0.0,11.0,...,29.5,87,37.1,1.0,4.0,0.0,,0.0,0.0,349
3,Emil Forsberg,Sweden,M,4,4,2.0,2.0,0.0,0.0,14.0,...,29.0,79,41.4,,7.0,,,1.0,0.0,371
4,Romelu Lukaku,Belgium,F,5,4,2.0,2.0,0.0,0.0,13.0,...,31.3,69,40.9,2.0,3.0,0.0,,0.0,0.0,444


Los datos en un dataframe de pandas o en un arreglo de numpy tambien tienen tipos de datos

[Pandas dtypes](https://pbpython.com/pandas_dtypes.html)

![image.png](attachment:image.png)

---

**Algunos Atributos:**

In [48]:
df_euro.dtypes

Player                  object
Country                 object
Position                object
Match played             int64
Goals                    int64
Right foot goals       float64
Left foot goals        float64
Header goals           float64
Assists                float64
Total attempts         float64
On target              float64
Off target             float64
Woodwork               float64
Shot blocks            float64
Avg gpg                float64
Fouls suffered         float64
Fouls committed          int64
Top speed              float64
Passing accuracy         int64
Distance covered       float64
Clearence attempted    float64
Balls recovered        float64
Tackels                float64
Blocks                 float64
Yellow cards           float64
Red cards              float64
Minutes played           int64
dtype: object

In [49]:
df_euro.shape

(59, 27)

In [50]:
df_euro.size

1593

In [40]:
df_euro.columns

Index(['Player', 'Country', 'Position', 'Match played', 'Goals',
       'Right foot goals', 'Left foot goals', 'Header goals', 'Assists',
       'Total attempts', 'On target', 'Off target', 'Woodwork', 'Shot blocks',
       'Avg gpg', 'Fouls suffered', 'Fouls committed', 'Top speed',
       'Passing accuracy', 'Distance covered', 'Clearence attempted',
       'Balls recovered', 'Tackels', 'Blocks', 'Yellow cards', 'Red cards',
       'Minutes played'],
      dtype='object')

In [52]:
df_euro.values

array([['Cristiano Ronaldo', 'Portugal', 'F', ..., 0.0, 0.0, 360],
       ['Patrik Schick', 'Czech Republic', 'F', ..., 0.0, 0.0, 404],
       ['Karim Benzema', 'France', 'F', ..., 0.0, 0.0, 349],
       ...,
       ['Oleksandr Zinchenko', 'Ukraine', 'D', ..., 0.0, 0.0, 480],
       ['Gareth Bale', 'Wales', 'F', ..., 1.0, 0.0, 365],
       ['Aaron Ramsey', 'Wales', 'M', ..., 0.0, 0.0, 354]], dtype=object)

---

**Algunos Métodos:**

In [42]:
df_euro.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59 entries, 0 to 58
Data columns (total 27 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Player               59 non-null     object 
 1   Country              59 non-null     object 
 2   Position             58 non-null     object 
 3   Match played         59 non-null     int64  
 4   Goals                59 non-null     int64  
 5   Right foot goals     57 non-null     float64
 6   Left foot goals      57 non-null     float64
 7   Header goals         57 non-null     float64
 8   Assists              58 non-null     float64
 9   Total attempts       58 non-null     float64
 10  On target            56 non-null     float64
 11  Off target           56 non-null     float64
 12  Woodwork             55 non-null     float64
 13  Shot blocks          56 non-null     float64
 14  Avg gpg              59 non-null     float64
 15  Fouls suffered       55 non-null     float

In [58]:
df_euro.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Match played,59.0,4.813559,1.166701,2.0,4.0,4.0,6.0,7.0
Goals,59.0,1.338983,1.504134,0.0,0.0,1.0,2.0,5.0
Right foot goals,57.0,0.649123,0.876096,0.0,0.0,0.0,1.0,3.0
Left foot goals,57.0,0.45614,0.757585,0.0,0.0,0.0,1.0,3.0
Header goals,57.0,0.210526,0.490511,0.0,0.0,0.0,0.0,2.0
Assists,58.0,0.741379,1.018433,0.0,0.0,0.0,1.0,4.0
Total attempts,58.0,7.62069,5.634566,0.0,2.25,7.0,12.0,20.0
On target,56.0,2.982143,2.673038,0.0,1.0,2.5,5.0,9.0
Off target,56.0,2.982143,2.576051,0.0,1.0,2.0,4.25,9.0
Woodwork,55.0,0.218182,0.497807,0.0,0.0,0.0,0.0,2.0


In [59]:
df_euro['Country'].value_counts()

Italy             7
England           7
Germany           5
Portugal          5
Spain             5
France            5
Denmark           4
Belgium           4
Switzerland       2
Austria           2
Croatia           2
Ukraine           2
Wales             2
Czech Republic    2
Netherlands       2
SwitzerlandF      1
Poland            1
Sweden            1
Name: Country, dtype: int64

In [45]:
df_euro['Position'].value_counts()

M    24
D    17
F    17
Name: Position, dtype: int64

In [46]:
df_euro.describe(include=[object])

Unnamed: 0,Player,Country,Position
count,59,59,58
unique,59,18,3
top,Dani Olmo,Italy,M
freq,1,7,24


---

In [47]:
df_euro.head()

Unnamed: 0,Player,Country,Position,Match played,Goals,Right foot goals,Left foot goals,Header goals,Assists,Total attempts,...,Top speed,Passing accuracy,Distance covered,Clearence attempted,Balls recovered,Tackels,Blocks,Yellow cards,Red cards,Minutes played
0,Cristiano Ronaldo,Portugal,F,4,5,3.0,2.0,0.0,1.0,16.0,...,29.7,86,36.1,2.0,2.0,1.0,,0.0,0.0,360
1,Patrik Schick,Czech Republic,F,5,5,1.0,3.0,1.0,0.0,16.0,...,32.4,71,43.6,4.0,6.0,3.0,,0.0,0.0,404
2,Karim Benzema,France,F,4,4,2.0,2.0,1.0,0.0,11.0,...,29.5,87,37.1,1.0,4.0,0.0,,0.0,0.0,349
3,Emil Forsberg,Sweden,M,4,4,2.0,2.0,0.0,0.0,14.0,...,29.0,79,41.4,,7.0,,,1.0,0.0,371
4,Romelu Lukaku,Belgium,F,5,4,2.0,2.0,0.0,0.0,13.0,...,31.3,69,40.9,2.0,3.0,0.0,,0.0,0.0,444


In [61]:
df_euro = df_euro.set_index('Country')
df_euro.head()

Unnamed: 0_level_0,Player,Position,Match played,Goals,Right foot goals,Left foot goals,Header goals,Assists,Total attempts,On target,...,Top speed,Passing accuracy,Distance covered,Clearence attempted,Balls recovered,Tackels,Blocks,Yellow cards,Red cards,Minutes played
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Portugal,Cristiano Ronaldo,F,4,5,3.0,2.0,0.0,1.0,16.0,8.0,...,29.7,86,36.1,2.0,2.0,1.0,,0.0,0.0,360
Czech Republic,Patrik Schick,F,5,5,1.0,3.0,1.0,0.0,16.0,9.0,...,32.4,71,43.6,4.0,6.0,3.0,,0.0,0.0,404
France,Karim Benzema,F,4,4,2.0,2.0,1.0,0.0,11.0,5.0,...,29.5,87,37.1,1.0,4.0,0.0,,0.0,0.0,349
Sweden,Emil Forsberg,M,4,4,2.0,2.0,0.0,0.0,14.0,4.0,...,29.0,79,41.4,,7.0,,,1.0,0.0,371
Belgium,Romelu Lukaku,F,5,4,2.0,2.0,0.0,0.0,13.0,6.0,...,31.3,69,40.9,2.0,3.0,0.0,,0.0,0.0,444


In [62]:
df_euro.loc['Portugal']

Unnamed: 0_level_0,Player,Position,Match played,Goals,Right foot goals,Left foot goals,Header goals,Assists,Total attempts,On target,...,Top speed,Passing accuracy,Distance covered,Clearence attempted,Balls recovered,Tackels,Blocks,Yellow cards,Red cards,Minutes played
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Portugal,Cristiano Ronaldo,F,4,5,3.0,2.0,0.0,1.0,16.0,8.0,...,29.7,86,36.1,2.0,2.0,1.0,,0.0,0.0,360
Portugal,Raphael Guerreiro,D,4,1,0.0,1.0,0.0,0.0,3.0,1.0,...,26.6,90,39.9,2.0,18.0,4.0,0.0,0.0,0.0,360
Portugal,Pepe,D,4,0,0.0,0.0,0.0,0.0,1.0,1.0,...,30.6,89,37.1,8.0,28.0,4.0,0.0,1.0,0.0,360
Portugal,Danilo,M,4,0,0.0,0.0,0.0,0.0,1.0,1.0,...,27.0,94,26.7,5.0,14.0,2.0,0.0,0.0,0.0,237
Portugal,Renato Sanches,M,4,0,0.0,0.0,0.0,0.0,4.0,1.0,...,29.0,91,24.8,1.0,10.0,0.0,0.0,0.0,0.0,220


In [63]:
df_euro = df_euro.reset_index()

In [51]:
df_euro.iloc[25:35]

Unnamed: 0,Country,Player,Position,Match played,Goals,Right foot goals,Left foot goals,Header goals,Assists,Total attempts,...,Top speed,Passing accuracy,Distance covered,Clearence attempted,Balls recovered,Tackels,Blocks,Yellow cards,Red cards,Minutes played
25,Italy,Marco Veratti,M,5,0,0.0,0.0,0.0,3.0,4.0,...,27.4,93,49.9,1.0,32.0,30.0,,1.0,0.0,400
26,Spain,Jordi Alba,D,6,0,0.0,0.0,0.0,2.0,2.0,...,31.5,86,64.2,7.0,33.0,4.0,0.0,1.0,0.0,553
27,England,John Stones,D,7,0,0.0,0.0,0.0,0.0,5.0,...,29.7,94,67.3,17.0,33.0,3.0,8.0,0.0,0.0,679
28,England,Kalvin Phillips,M,7,0,0.0,0.0,0.0,1.0,4.0,...,30.2,87,83.0,10.0,32.0,13.0,,1.0,0.0,665
29,England,Harry Maguire,D,5,1,0.0,0.0,1.0,0.0,8.0,...,30.6,92,51.5,29.0,34.0,6.0,4.0,3.0,0.0,510
30,England,Mason Mount,M,5,0,0.0,0.0,0.0,1.0,7.0,...,31.5,85,50.4,1.0,11.0,2.0,,0.0,0.0,464
31,France,Kylian Mbappe,F,4,0,0.0,0.0,0.0,1.0,14.0,...,32.2,78,35.4,0.0,4.0,0.0,,0.0,0.0,390
32,France,Raphael Varane,D,4,0,0.0,0.0,0.0,0.0,2.0,...,32.6,94,40.8,18.0,23.0,4.0,2.0,1.0,0.0,390
33,France,Ngolo Kante,M,4,0,0.0,0.0,0.0,0.0,1.0,...,28.8,93,46.2,2.0,25.0,10.0,,0.0,0.0,390
34,France,Paul Pogba,M,4,1,1.0,0.0,0.0,1.0,8.0,...,29.5,86,41.8,4.0,11.0,7.0,,0.0,0.0,376


---

**Algunos operaciones:**

In [68]:
df_euro[df_euro['Goals'] > 3][['Player', "Goals"]]

Unnamed: 0,Player,Goals
0,Cristiano Ronaldo,5
1,Patrik Schick,5
2,Karim Benzema,4
3,Emil Forsberg,4
4,Romelu Lukaku,4
5,Harry Kane,4
8,Kasper Dolberg,4


---

Alguna documentación sobre análisis descriptivo con dataframes de Pandas en Python:

[Documentación](https://blog.adrianistan.eu/estadistica-python-media-mediana-varianza-percentiles-parte-iii)

In [53]:
import scipy.stats as ss

In [54]:
#  Si es positiva tiene cola a la derecha y si es negativa tiene cola a la izquierda.
ss.skew(df_euro["Goals"])

0.8189216598168358

In [55]:
# Coeficiente de variación
ss.variation(df_euro["Goals"])

1.1137801978460222

---

In [56]:
df_cotton_plus = pd.read_excel('Data DS Cotton Plus.xlsx', sheet_name='BD')
df_cotton_plus

ERROR! Session/line number was not unique in database. History logging moved to new session 19


Unnamed: 0,Cedula,Nombre Completo,Segmento,Compro 2019?,Compro 2020?,Monto Comprado 2019?,Nro Transacciones 2019,Monto Comprado 2020,Nro Transacciones 2020,Tipo de Contacto,...,TelefonoOficina,Telefono Casa,Monto - MARCA COTTON- B/quilla.,Monto - Brand Store Cucuta,Monto - MARCA COTTON- Bucaramanga,Monto - MARCA COTTON- Manizales,Monto - MARCA COTTON- Pereira,Monto - MARCA COTTON- Cartagena,Monto - MARCA COTTON- Pasto,Monto - MARCA COTTON- Cable Plaza
0,1,Nombre Cliente 1,Aficionados,Si,No,264800,2,0,0,11001,...,,,0,0,0,0,0,0,0,0
1,2,Nombre Cliente 2,Esporádicos,Si,No,179900,1,0,0,11001,...,,,179900,0,0,0,0,0,0,0
2,3,Nombre Cliente 3,Aficionados,Si,No,189800,2,0,0,11101,...,,,451200,0,0,0,0,0,0,0
3,4,Nombre Cliente 4,Aficionados,Si,No,65900,1,0,0,11001,...,,,0,0,0,0,0,0,0,1859600
4,5,Nombre Cliente 5,Aficionados,Si,No,238800,2,0,0,11101,...,,,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9971,9972,Nombre Cliente 9972,Aficionados,Si,No,191260,4,0,0,11001,...,,,0,0,0,0,0,0,0,0
9972,9973,Nombre Cliente 9973,Aficionados,Si,No,371600,5,0,0,10100,...,,,0,0,0,0,0,0,0,0
9973,9974,Nombre Cliente 9974,Aficionados,Si,No,254960,3,0,0,11001,...,,,0,0,0,0,0,0,0,0
9974,9975,Nombre Cliente 9975,Esporádicos,Si,No,159920,1,0,0,10001,...,,,0,0,0,0,0,0,0,0
