# Tema 8: Ejercicio Reglas de Asociación

BÚSQUEDA DE PATRONES MEDIANTE REGLAS DE ASOCIACIÓN

Utilizando el dataset **IncomeESL** incluido con la librería arules (R), se pide generar
reglas de asociación.

Para ello, previamente deberá depurar el dataset. En particular:
-Revisar que no haya valores omitidos.
-Transformar los factores en valores numéricos. ← no es necesario!!!
-Una vez depurado el dataset, crear la matriz de transacciones usando la
función transactions.

A la hora de ejecutar el algoritmo para obtener las reglas, no olvide establecer
los valores de los parámetros de la función apriori, justificando el motivo de su elección.

Por último, elabore un breve informe resumiendo las reglas obtenidas y
analizando su significado.


Importamos dependencias

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

## Paso 1: importar datos

In [2]:
## import data
income_raw = pd.read_csv(r"./income_raw.csv",sep=',')

In [3]:
income_raw.head()

Unnamed: 0,Unnamed,income,sex,marital status,age,education,occupation,years in bay area,dual incomes,number in household,number of children,householder status,type of home,ethnic classification,language in home
0,1,75+,female,married,45-54,college (1-3 years),homemaker,>10,no,3,0,own,house,white,
1,2,75+,male,married,45-54,college graduate,homemaker,>10,no,5,2,own,house,white,english
2,3,75+,female,married,25-34,college graduate,professional/managerial,>10,yes,3,1,rent,apartment,white,english
3,4,"[0,10)",female,single,14-17,grades 9-11,student,>10,not married,4,2,live with parents/family,house,white,english
4,5,"[0,10)",female,single,14-17,grades 9-11,student,4-6,not married,4,2,live with parents/family,house,white,english


In [4]:
income_raw.describe(include='all')

Unnamed: 0,Unnamed,income,sex,marital status,age,education,occupation,years in bay area,dual incomes,number in household,number of children,householder status,type of home,ethnic classification,language in home
count,8993.0,8993,8993,8833,8993,8907,8857,8080,8993,8618.0,8993.0,8753,8636,8925,8634
unique,,9,2,5,7,6,9,5,3,9.0,10.0,3,5,8,3
top,,"[0,10)",female,single,25-34,college (1-3 years),professional/managerial,>10,not married,2.0,0.0,rent,house,white,english
freq,,1745,4918,3654,2249,3066,2820,5182,5438,2664.0,5724.0,3670,5073,5811,7794
mean,4497.0,,,,,,,,,,,,,,
std,2596.199819,,,,,,,,,,,,,,
min,1.0,,,,,,,,,,,,,,
25%,2249.0,,,,,,,,,,,,,,
50%,4497.0,,,,,,,,,,,,,,
75%,6745.0,,,,,,,,,,,,,,


## Paso 2: explorar y procesar datos

Tenemos que eliminar los registros que no estén completos.

In [5]:
#rename first column
#income_raw.rename(columns={'Unnamed':'id'}, inplace=True)
#income_raw.head()

#remove first column
income_raw.drop(income_raw.columns[0], axis=1, inplace=True)
income_raw.head()

Unnamed: 0,income,sex,marital status,age,education,occupation,years in bay area,dual incomes,number in household,number of children,householder status,type of home,ethnic classification,language in home
0,75+,female,married,45-54,college (1-3 years),homemaker,>10,no,3,0,own,house,white,
1,75+,male,married,45-54,college graduate,homemaker,>10,no,5,2,own,house,white,english
2,75+,female,married,25-34,college graduate,professional/managerial,>10,yes,3,1,rent,apartment,white,english
3,"[0,10)",female,single,14-17,grades 9-11,student,>10,not married,4,2,live with parents/family,house,white,english
4,"[0,10)",female,single,14-17,grades 9-11,student,4-6,not married,4,2,live with parents/family,house,white,english


In [6]:
#remove no complete records
income_complete = income_raw.dropna(axis=0, inplace=False)

In [7]:
income_complete.describe(include='all')

Unnamed: 0,income,sex,marital status,age,education,occupation,years in bay area,dual incomes,number in household,number of children,householder status,type of home,ethnic classification,language in home
count,6876,6876,6876,6876,6876,6876,6876,6876,6876,6876,6876,6876,6876,6876
unique,9,2,5,7,6,9,5,3,9,10,3,5,8,3
top,"[0,10)",female,single,25-34,college (1-3 years),professional/managerial,>10,not married,2,0,rent,house,white,english
freq,1255,3809,2813,1768,2407,2333,4446,4114,2156,4276,2882,4102,4605,6277


In [8]:
income_complete.head()

Unnamed: 0,income,sex,marital status,age,education,occupation,years in bay area,dual incomes,number in household,number of children,householder status,type of home,ethnic classification,language in home
1,75+,male,married,45-54,college graduate,homemaker,>10,no,5,2,own,house,white,english
2,75+,female,married,25-34,college graduate,professional/managerial,>10,yes,3,1,rent,apartment,white,english
3,"[0,10)",female,single,14-17,grades 9-11,student,>10,not married,4,2,live with parents/family,house,white,english
4,"[0,10)",female,single,14-17,grades 9-11,student,4-6,not married,4,2,live with parents/family,house,white,english
5,"[50,75)",male,married,55-64,college (1-3 years),retired,>10,no,2,0,own,house,white,english


In [9]:
# Total number of transactions and ítems
#for item in income_complete:
#    print(f"Total items {item}: {income_complete[item].nunique()}")

In [10]:
#income_complete.groupby('id')['income'].apply(list)
#income_complete.groupby('id')['sex'].apply(list)

In [11]:
income_hot_encoded = pd.get_dummies(income_complete, dtype='boolean')  

# if we do hot encoding with integers we get the warning:
#
# /home/francd/anaconda3/envs/masterMLpythonConda/lib/python3.11/site-packages/mlxtend/frequent_patterns/fpcommon.py:161: 
# DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type

income_hot_encoded.head()

Unnamed: 0,income_75+,"income_[0,10)","income_[10,15)","income_[15,20)","income_[20,25)","income_[25,30)","income_[30,40)","income_[40,50)","income_[50,75)",sex_female,...,ethnic classification_asian,ethnic classification_black,ethnic classification_east indian,ethnic classification_hispanic,ethnic classification_other,ethnic classification_pacific islander,ethnic classification_white,language in home_english,language in home_other,language in home_spanish
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,True,False,False
2,True,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,True,True,False,False
3,False,True,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,True,True,False,False
4,False,True,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,True,True,False,False
5,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,True,True,False,False


## Paso 3: Entrenamiento del modelo

In [12]:
# Apply the apriori algorithm to find frequent itemsets
frequent_itemsets = apriori(income_hot_encoded, min_support=0.5, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.553956,(sex_female)
1,0.646597,(years in bay area_>10)
2,0.598313,(dual incomes_not married)
3,0.621873,(number of children_0)
4,0.596568,(type of home_house)
5,0.669721,(ethnic classification_white)
6,0.912885,(language in home_english)
7,0.512216,"(language in home_english, sex_female)"
8,0.601367,"(years in bay area_>10, language in home_english)"
9,0.542612,"(dual incomes_not married, language in home_en..."


In [13]:
# Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(language in home_english),(sex_female),0.912885,0.553956,0.512216,0.561096,1.01289,1.0,0.006518,1.016268,0.146079,0.536563,0.016008,0.742874
1,(sex_female),(language in home_english),0.553956,0.912885,0.512216,0.924652,1.01289,1.0,0.006518,1.156166,0.02853,0.536563,0.135072,0.742874
2,(years in bay area_>10),(language in home_english),0.646597,0.912885,0.601367,0.930049,1.018802,1.0,0.011098,1.245375,0.052221,0.627656,0.197029,0.794402
3,(language in home_english),(years in bay area_>10),0.912885,0.646597,0.601367,0.658754,1.018802,1.0,0.011098,1.035626,0.211848,0.627656,0.034401,0.794402
4,(dual incomes_not married),(language in home_english),0.598313,0.912885,0.542612,0.906903,0.993447,1.0,-0.003579,0.935743,-0.016156,0.56021,-0.06867,0.750648
5,(language in home_english),(dual incomes_not married),0.912885,0.598313,0.542612,0.594392,0.993447,1.0,-0.003579,0.990334,-0.070389,0.56021,-0.009761,0.750648
6,(language in home_english),(number of children_0),0.912885,0.621873,0.580134,0.635495,1.021904,1.0,0.012435,1.03737,0.246049,0.607709,0.036024,0.784188
7,(number of children_0),(language in home_english),0.621873,0.912885,0.580134,0.932881,1.021904,1.0,0.012435,1.297917,0.056686,0.607709,0.229534,0.784188
8,(type of home_house),(language in home_english),0.596568,0.912885,0.544648,0.912969,1.000092,1.0,5e-05,1.000964,0.000228,0.564516,0.000963,0.754796
9,(language in home_english),(type of home_house),0.912885,0.596568,0.544648,0.596623,1.000092,1.0,5e-05,1.000136,0.001055,0.564516,0.000136,0.754796


Igual que en R. Las reglas son las mismas que en R para soporte=0.5 y confianza=0.5 ...

**Antes de seguir, vamos a hacer lo mismo que en R: tranformar la variable "income" de 9 factores en 3:**

In [14]:
pd.Categorical(income_complete['income'])

['75+', '75+', '[0,10)', '[0,10)', '[50,75)', ..., '[0,10)', '[10,15)', '[0,10)', '[20,25)', '[30,40)']
Length: 6876
Categories (9, object): ['75+', '[0,10)', '[10,15)', '[15,20)', ..., '[25,30)', '[30,40)', '[40,50)', '[50,75)']

In [15]:
def incomeTo3Levels(level):
    match level:
        case "[0,10)" | "[10,15)" | "[15,20)": 
            return "0-20k$"
        case "[20,25)" | "[25,30)" | "[30,40)" | "[40,50)": 
            return "20k-50k$"
        case "[50,75)" | "75+": 
            return "50k+$"
#tests
#print(incomeTo3Levels("[10,15)"))
#print(incomeTo3Levels("[30,40)"))
#print(incomeTo3Levels("75+"))

In [16]:
levels3L_df = income_complete['income'].apply(lambda income: incomeTo3Levels(income)).rename("income3L")
levels3L_df.head()

1     50k+$
2     50k+$
3    0-20k$
4    0-20k$
5     50k+$
Name: income3L, dtype: object

In [17]:
income_complete = income_complete.join(levels3L_df)
income_complete

Unnamed: 0,income,sex,marital status,age,education,occupation,years in bay area,dual incomes,number in household,number of children,householder status,type of home,ethnic classification,language in home,income3L
1,75+,male,married,45-54,college graduate,homemaker,>10,no,5,2,own,house,white,english,50k+$
2,75+,female,married,25-34,college graduate,professional/managerial,>10,yes,3,1,rent,apartment,white,english,50k+$
3,"[0,10)",female,single,14-17,grades 9-11,student,>10,not married,4,2,live with parents/family,house,white,english,0-20k$
4,"[0,10)",female,single,14-17,grades 9-11,student,4-6,not married,4,2,live with parents/family,house,white,english,0-20k$
5,"[50,75)",male,married,55-64,college (1-3 years),retired,>10,no,2,0,own,house,white,english,50k+$
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8988,"[0,10)",female,single,14-17,grade <9,sales,>10,not married,3,2,live with parents/family,house,white,english,0-20k$
8989,"[10,15)",male,single,18-24,college (1-3 years),professional/managerial,>10,not married,4,0,live with parents/family,house,white,english,0-20k$
8990,"[0,10)",female,single,14-17,grades 9-11,professional/managerial,>10,not married,3,2,live with parents/family,house,white,english,0-20k$
8991,"[20,25)",male,married,55-64,college (1-3 years),laborer,>10,yes,3,1,rent,apartment,white,english,20k-50k$


In [18]:
income3L = income_complete.drop('income',axis=1).copy()
income3L

Unnamed: 0,sex,marital status,age,education,occupation,years in bay area,dual incomes,number in household,number of children,householder status,type of home,ethnic classification,language in home,income3L
1,male,married,45-54,college graduate,homemaker,>10,no,5,2,own,house,white,english,50k+$
2,female,married,25-34,college graduate,professional/managerial,>10,yes,3,1,rent,apartment,white,english,50k+$
3,female,single,14-17,grades 9-11,student,>10,not married,4,2,live with parents/family,house,white,english,0-20k$
4,female,single,14-17,grades 9-11,student,4-6,not married,4,2,live with parents/family,house,white,english,0-20k$
5,male,married,55-64,college (1-3 years),retired,>10,no,2,0,own,house,white,english,50k+$
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8988,female,single,14-17,grade <9,sales,>10,not married,3,2,live with parents/family,house,white,english,0-20k$
8989,male,single,18-24,college (1-3 years),professional/managerial,>10,not married,4,0,live with parents/family,house,white,english,0-20k$
8990,female,single,14-17,grades 9-11,professional/managerial,>10,not married,3,2,live with parents/family,house,white,english,0-20k$
8991,male,married,55-64,college (1-3 years),laborer,>10,yes,3,1,rent,apartment,white,english,20k-50k$


Volvemos a repetir los pasos (crear la matriz dispersa mediante hot encoding, etc.)

In [19]:
income3L_hot_encoded = pd.get_dummies(income3L, dtype='boolean')  

income3L_hot_encoded.head()

#other way
#
#te = TransactionEncoder()
#transactions = te.fit(dataset).transform(dataset, sparse=True)
#sparse_df = pd.DataFrame.sparse.from_spmatrix(transactions, columns=te.columns_)
#sparse_df

Unnamed: 0,sex_female,sex_male,marital status_cohabitation,marital status_divorced,marital status_married,marital status_single,marital status_widowed,age_14-17,age_18-24,age_25-34,...,ethnic classification_hispanic,ethnic classification_other,ethnic classification_pacific islander,ethnic classification_white,language in home_english,language in home_other,language in home_spanish,income3L_0-20k$,income3L_20k-50k$,income3L_50k+$
1,False,True,False,False,True,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,True
2,True,False,False,False,True,False,False,False,False,True,...,False,False,False,True,True,False,False,False,False,True
3,True,False,False,False,False,True,False,True,False,False,...,False,False,False,True,True,False,False,True,False,False
4,True,False,False,False,False,True,False,True,False,False,...,False,False,False,True,True,False,False,True,False,False
5,False,True,False,False,True,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,True


Aplicar Apriori:

In [20]:
# Apply the apriori algorithm to find frequent itemsets
frequent_itemsets_3L = apriori(income3L_hot_encoded, min_support=0.1, use_colnames=True)
#frequent_itemsets_3L

In [21]:
# Generate association rules
rules3L = association_rules(frequent_itemsets_3L, metric="confidence", min_threshold=0.55)
rules3L

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(marital status_married),(sex_female),0.385689,0.553956,0.231094,0.599170,1.081621,1.0,0.017439,1.112803,0.122840,0.326149,0.101368,0.508170
1,(age_35-44),(sex_female),0.184264,0.553956,0.102676,0.557222,1.005896,1.0,0.000602,1.007376,0.007185,0.161556,0.007322,0.371286
2,(education_college (1-3 years)),(sex_female),0.350058,0.553956,0.195462,0.558371,1.007971,1.0,0.001546,1.009999,0.012167,0.275862,0.009900,0.455610
3,(education_high school graduate),(sex_female),0.215096,0.553956,0.125800,0.584855,1.055779,1.0,0.006646,1.074429,0.067310,0.195569,0.069273,0.405974
4,(years in bay area_>10),(sex_female),0.646597,0.553956,0.370273,0.572650,1.033746,1.0,0.012087,1.043743,0.092371,0.445963,0.041910,0.620533
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5515,"(type of home_apartment, language in home_engl...","(householder status_rent, dual incomes_not mar...",0.197935,0.196335,0.114020,0.576047,2.934000,1.0,0.075158,1.895646,0.821839,0.406850,0.472475,0.578394
5516,"(householder status_rent, type of home_apartme...","(dual incomes_not married, language in home_en...",0.164921,0.368237,0.114020,0.691358,1.877479,1.0,0.053290,2.046911,0.559673,0.272033,0.511459,0.500497
5517,"(type of home_apartment, ethnic classification...","(householder status_rent, dual incomes_not mar...",0.169721,0.242001,0.114020,0.671808,2.776053,1.0,0.072947,2.309620,0.770556,0.383000,0.567028,0.571481
5518,"(dual incomes_not married, type of home_apartm...","(householder status_rent, ethnic classificatio...",0.206079,0.218732,0.114020,0.553282,2.529497,1.0,0.068944,1.748905,0.761618,0.366869,0.428214,0.537279


In [22]:
#look for income in RHS/consequents
#match = {"income3L_0-20k$", "income3L_20k-50k$", "income3L_50k+$"}  
match = {"income3L_20k-50k$"}  

rules3L['3levels'] = ~rules3L['consequents'].apply(match.isdisjoint) 
#rules3L

In [23]:
rules3L_filtered = rules3L.loc[rules3L['3levels']==True]
rules3L_filtered

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski,3levels
759,"(householder status_rent, age_25-34)",(income3L_20k-50k$),0.172048,0.403578,0.100931,0.586644,1.453609,1.0,0.031496,1.442879,0.376902,0.212623,0.306941,0.418367,True
861,"(dual incomes_not married, occupation_professi...",(income3L_20k-50k$),0.170448,0.403578,0.101658,0.596416,1.477823,1.0,0.032869,1.477816,0.389763,0.215209,0.323326,0.424154,True
1890,"(householder status_rent, language in home_eng...",(income3L_20k-50k$),0.184264,0.403578,0.101803,0.552486,1.368971,1.0,0.027438,1.332746,0.330406,0.209455,0.24967,0.402369,True



Son las mismas 3 reglas relacionadas con los sueldos intermedios que en R !!!
