# Laboratorio No. 4
- Ricardo Méndez 21289
- Sara Echeverría 21371
- Francisco Castillo 21562
- Melissa Pérez 21385

Enlace al repositorio: [https://github.com/bl33h/bankCustomerSegmentation]

# Task 1

### ¿Qué son los grafos computacionales? y ¿cuál es su importancia para el cálculo de gradientes en aplicaciones como backpropagation?
Los grafos computacionales son una representación visual y matemática donde sus nodos corresponden a una operación y los arcos los valores de entrada/salida. Se usan comúnmente para descibir algoritmos y modelos de aprendizaje automático. En el cálculo de gradientes toman un papel importante porque proporcionan una representación clara y estructurada de las operaciones. Lo cual ayuda a que las conexiones entre nodos representen el flujo de datos entre las operaciones. Además, los grafos computacionales permiten realizar el cálculo de los gradientes de forma óptima, lo que ayuda a entrenar modelos por medio de algoritmos.

[Understanding Gradient Descent and Backpropagation](https://www.shramos.com/2019/02/understanding-gradient-descent-and_3.html)

### Detalle cuales son los componentes y pasos que conforman una red neuronal. Con esto en mente, ¿cómo mejoraría el perceptrón que hizo en el laboratorio pasado?

### ¿Cómo se selecciona el valor K usando el método de la silueta para el algoritmo de K-Means. Explique las fórmulas (ecuaciones) que lo componen así como las asumpciones, si hay.

### Investigue sobre Principal Component Analysis (PCA) y responda respecto a algoritmos como K-Means ¿Cómo podría ayudarle a mejorar la calidad de sus clusters cuando se usa K-Means?

# Análisis exploratorio

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
plt.style.use('ggplot')

In [2]:
df = pd.read_csv('data/bank_transactions.csv')
df.head()

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,T1,C5841053,10/1/94,F,JAMSHEDPUR,17819.05,2/8/16,143207,25.0
1,T2,C2142763,4/4/57,M,JHAJJAR,2270.69,2/8/16,141858,27999.0
2,T3,C4417068,26/11/96,F,MUMBAI,17874.44,2/8/16,142712,459.0
3,T4,C5342380,14/9/73,F,MUMBAI,866503.21,2/8/16,142714,2060.0
4,T5,C9031234,24/3/88,F,NAVI MUMBAI,6714.43,2/8/16,181156,1762.5


In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048567 entries, 0 to 1048566
Data columns (total 9 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   TransactionID            1048567 non-null  object 
 1   CustomerID               1048567 non-null  object 
 2   CustomerDOB              1045170 non-null  object 
 3   CustGender               1047467 non-null  object 
 4   CustLocation             1048416 non-null  object 
 5   CustAccountBalance       1046198 non-null  float64
 6   TransactionDate          1048567 non-null  object 
 7   TransactionTime          1048567 non-null  int64  
 8   TransactionAmount (INR)  1048567 non-null  float64
dtypes: float64(2), int64(1), object(6)
memory usage: 72.0+ MB
None


In [5]:
print(df.describe())

       CustAccountBalance  TransactionTime  TransactionAmount (INR)
count        1.046198e+06     1.048567e+06             1.048567e+06
mean         1.154035e+05     1.570875e+05             1.574335e+03
std          8.464854e+05     5.126185e+04             6.574743e+03
min          0.000000e+00     0.000000e+00             0.000000e+00
25%          4.721760e+03     1.240300e+05             1.610000e+02
50%          1.679218e+04     1.642260e+05             4.590300e+02
75%          5.765736e+04     2.000100e+05             1.200000e+03
max          1.150355e+08     2.359590e+05             1.560035e+06


In [6]:
# CustomerID is included to be able to merge other important features like average transaction amount, number of transactions
features = df[['CustomerID','CustomerDOB', 'CustGender',
       'CustLocation', 'CustAccountBalance', 'TransactionDate',
       'TransactionTime', 'TransactionAmount (INR)']].copy()

### Encoding y otras variables útiles

In [7]:
# cathegorical features and change other values to its numerical representation
features['CustGender'] = features['CustGender'].map({'F': 1, 'M': 0})
features['TransactionDate'] = pd.to_datetime(features['TransactionDate'])
features['CustomerDOB'] = pd.to_datetime(features['CustomerDOB'])

In [9]:
# calculate the monetary transactions for each customer
monetary = df.groupby('CustomerID')['TransactionAmount (INR)'].sum() 
totalTransactions = monetary.reset_index() 
totalTransactions.columns = ['CustomerID', 'Monetary' ]
totalTransactions 

Unnamed: 0,CustomerID,Monetary
0,C1010011,5106.0
1,C1010012,1499.0
2,C1010014,1455.0
3,C1010018,30.0
4,C1010024,5000.0
...,...,...
884260,C9099836,691.0
884261,C9099877,222.0
884262,C9099919,126.0
884263,C9099941,50.0


In [10]:
def timeToSeconds (time):
    hours = time // 10000
    minutes = (time % 10000) // 100
    seconds = time % 100
    return hours * 3600 + minutes * 60 + seconds
df['TransactionTimeSeconds'] = df['TransactionTime'].apply(timeToSeconds)

# avg transaction time in seconds for each customer
averageTransactionTime = df.groupby('CustomerID')['TransactionTimeSeconds'].mean()

print(averageTransactionTime.head())

CustomerID
C1010011    24921.0
C1010012    74649.0
C1010014    68038.0
C1010018    61374.0
C1010024    51063.0
Name: TransactionTimeSeconds, dtype: float64


In [12]:
# insert the new features into the features dataframe
features = features.merge(averageTransactionTime, on='CustomerID', how='left') # the name in the features is 'TransactionTimeSeconds'
features = features.merge(totalTransactions, on='CustomerID', how='left') # the name in the features is 'Monetary'

In [13]:
# cathegorical features and change other values to its numerical representation
features['Monetary'] = pd.to_numeric(features['Monetary'])
features['TransactionTimeSeconds'] = pd.to_numeric(features['TransactionTimeSeconds'])

In [14]:
print(features.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1048567 entries, 0 to 1048566
Data columns (total 11 columns):
 #   Column                   Non-Null Count    Dtype         
---  ------                   --------------    -----         
 0   CustomerID               1048567 non-null  object        
 1   CustomerDOB              1045170 non-null  datetime64[ns]
 2   CustGender               1047466 non-null  float64       
 3   CustLocation             1048416 non-null  object        
 4   CustAccountBalance       1046198 non-null  float64       
 5   TransactionDate          1048567 non-null  datetime64[ns]
 6   TransactionTime          1048567 non-null  int64         
 7   TransactionAmount (INR)  1048567 non-null  float64       
 8   CustomerAge              1045170 non-null  float64       
 9   TransactionTimeSeconds   1048567 non-null  float64       
 10  Monetary                 1048567 non-null  float64       
dtypes: datetime64[ns](2), float64(6), int64(1), object(2)
memory us

### Balanceo de clases

En este caso la variable de interés 'CustGender' sí está desbalanceada, por lo que es pertinente aplicar una técnica para corregirlo.

In [15]:
features['CustGender'].value_counts()

0.0    765530
1.0    281936
Name: CustGender, dtype: int64

In [16]:
male = features[features['CustGender'] == 0]
female = features[features['CustGender'] == 1]
male = male.sample(n=len(female), random_state=7)
features = pd.concat([male, female])

In [17]:
features['CustGender'].value_counts()

0.0    281936
1.0    281936
Name: CustGender, dtype: int64

### Scaling
Las magnitudes de los datos son muy diferentes, por lo que es necesario escalarlos de 'TransactionTimeSeconds'.

In [18]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [29]:
# apply the scaler
features['TransactionTimeSeconds'] = scaler.fit_transform(features['TransactionTimeSeconds'].values.reshape(-1, 1))
mean_transaction_time_seconds = features['TransactionTimeSeconds'].mean()


### Splitting

In [30]:
from sklearn.model_selection import train_test_split

# monetary transactions and gender of the customer
feature = features[['Monetary', 'CustGender']]

# avg time of transaction in seconds for each customer
target = features['TransactionTimeSeconds']

X_train, X_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=7)

# Task 2.1 K-Means

# Task 2.2 Mixture Models