In [None]:
#import all of the necessary libraries
!pip install efficient-apriori
import pandas as pd
import numpy as np
from efficient_apriori import apriori


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Colin Hwang and Fakharyar Khan**
**ECE475 - Project 7: Market Basket Analysis**

For our dataset, we chose to perform market basket analysis on a "student performance" dataset. Within the dataset, a wide variety of information of students was collected with the goal of predicting how well a student performs in certain classes. The intended label(s) was the the grade that students received in their first and second semesters, as well as their final, combined grade. The data was collected on students completing secondary education in two different Portuguese schools, where class subjects were math and Portuguese. The data collected for each class was collected in separate datasets; we decided to only perform analysis on the math dataset, where the performance of students varied more. There were a total of 30 attributes, which included information such as the occupation of the student's parents, the relationship of the students with their parents, whether or not the student drinks (alchohol) during the weekend/weekdays, whether the student is in a romantic relationship, etc. The attributes used to determine a student's performance in a math class were quite diverse and had the potential to create interesting association rules with each other.

All of the features can be read in this link to the dataset: https://archive.ics.uci.edu/ml/datasets/student+performance
________________________________________________________________________________



For the school feature, we found that the dataset only contained data from
two schools so we likely would have had a bunch of rules focused on those
two schools just because they're present in all of our datasets
so a market basket analysis wouldn't really tell us the relationship
between the school the students go to and the other features in our dataset

Pstatus indicated if the student's parents were married. Nursery told us
if the student went to nursery school and internet told us if they have
internet access. We thought these would be interesting features
to have but we found that all of our rules included at least one of these 3 features
we were really confused about this because it was telling us with high confidence
stuff like if a student went to nursery school, they'll pursue higher education

What happened was that these three features were present in almost every datapoint
Almost every student went to nursery school, had internet, and their parents were together
So rules like {internet} -> {nursery} will always pop up with high confidence
As a result, we had to drop these 3 features too because they couldn't tell us anything about the other features in the dataset.

In [None]:
#load in the dataset into a pandas dataframe
df = pd.read_csv("student-mat.csv", sep=';')
#drop out the "problematic" features discussed above

df = df.drop(["school", "Pstatus", "nursery", "internet"], axis = 1)


After dropping those features, we ran into the first hurdle in our project: what do we do with the numerical features in our dataset? We needed a way to represent not just the presence of a feature but also its "intensity". If we buy 64 apples, we can't just say label that as "having bought apples", the number tells us a lot about the consumer. At the same time however, we can't pass these numerical features as they are because they will be deemed worthless to the market basket analysis. If "64 apples" is now an item, there's a really low chance of that item appearing in another person's shopping cart. So we do need to group these numerical features and the way we do this is going to be different for each feature.

Apparently, in Portugal, students are graded on a scale of 1-20 which is pretty interesting. As a result, the G1, G2, and G3 features go from 1-20 as well. For G1, since it's the first term, the students probably wouldn't have had many tests so their grades would likely be all over the place. Additionally, since it's the beginning of the semester, we wanted to use a more lenient grading distribution since the grades they get then aren't very indicative of their understanding of the material. 

For G2 and G3, since by then we have a better understanding of the student's proficiency in the subject, we made the grading distribution harsher. To choose the grades, we converted the 1-20 system to an A-F system which might not have been the best option. It's possible that in Portugal, a grade of 15 doesn't actually translate to a 75%. However, without any reference on the grading system, we decided to stick with our conversion.

Then for age, we split it between students below 18 and those above 18. So the reason why we did this was because as far as we knew, there weren't any significant differences between people within these age groups. However, there is a significant difference between students above and below 18. For one, at 18, in Portugal, it's legal for students to drink alcohol which is also a feature in our dataset. We didn't look into this but if Portugal is like the US, then students who are 18 or older can work for longer hours which can definitely impact school performance. 

Finally, for absences, we didn't have any intuitive approach to grouping the feature values. So instead we decided to split the feature based on what percent quartile it was in. We didn't use the mean because there seemed to be a lot of outliers for number of abscences (one person was absent 93 times!). We decided to split the feature into three categories:
Rarely Absent, Often Absent, and Frequently Absent. We didn't want to split the feature further because we felt that the feature would lose any meaninful intepretation even if it does end up in a lot of rules.

In [None]:

#df["G1"] = np.where(df["G1"] < 10, "Fail Math1", "Pass Math1")
df["G1"] = np.select([df["G1"] < 10, df["G1"] < 16, df["G1"] < 20],
                     ["C: Math1", "B: Math1", "A Math1"])

#df["G2"] = np.where(df["G2"] < 10, "Fail Math2", "Pass Math2")

df["G2"] = np.select([df["G2"] < 12, df["G2"] < 14, df["G2"] < 16, df["G2"] < 18],
                     ["F: Math2", "D: Math2", "C: Math2", "B: Math2"], "A: Math2")

#df["G3"] = np.where(df["G3"] < 10, "Fail Math3", "Pass Math3")
df["G3"] = np.select([df["G3"] < 12, df["G3"] < 14, df["G3"] < 16, df["G3"] < 18],
                     ["F: Math3", "D: Math3", "C: Math3", "B: Math3"], "A: Math3")

df["age"] = np.select([df["age"] <= 17, df["age"] > 17], ["Below 18", "Above 17"])

lquantile = df["absences"].quantile(0.3)
mquantile = df["absences"].quantile(0.6)

df["absences"] = np.select([df["absences"] < lquantile, df["absences"] < mquantile],
                           ["Rarely Absent", "Often Absent" ], 
                           "Frequently Absent")

df["failures"] = np.select([df["failures"] < 2], ["Failed Few or No Classes"], "Failed a lot of Classes")


Here we converted the binary, yes or no features into labels with names that better indicate what they are.

In [None]:
#convert all of the features into strings
for i in df.columns:
  df[i] = df[i].astype(str)

df["schoolsup"] = np.where(df["schoolsup"] == "no", "no school support", "school support")
df["famsup"] = np.where(df["famsup"] == "no", "no family educational support", "family educational suport")
df["paid"] = np.where(df["paid"] == "no", "no Tutoring", "Tutoring")
df["activities"] = np.where(df["activities"] == "no", "no extracurriculars", "Extracurriculars")
df["higher"] = np.where(df["higher"] == "no", "no higher education", "higher education")
df["romantic"] = np.where(df["romantic"] == "no", "no romantic", "romantic")

A good chunk of the features in the dataset had values that were ratings from 1-5 (which makes sense since the dataset came from a survey). For all but one of these features we found that we could represent them very well using just two bins that indicate the presence or lack of presence of the feature. So for example, you either do study a lot or don't study, have a lot of free time or don't, etc. Since these are survey questions, it would be hard to make any further inferences using the intensities of the features since how people rate stuff pertaining to themselves varies wildly from person to person. For this reason, we also didn't use averages since they wouldn't be a good representation of the center for this type of data. Instead, the median rating 3, seemed like a much better value for the center. 

The last choice we made regarded making the absence of a feature an item. We initially didn't do this because it didn't really make sense to us to do that. If a user doesn't buy a pencil, I wouldn't say that the user bought a "notpencil". However, this isn't actually the case here. Here, the user chose to "buy" the No Free Time package and that's something we want to know when we conduct the market basket analysis. 

In [None]:

#finds subsets of a column of df that have a rating below 3 and 
#a rating above 3. Takes in the dataframe column
def rate_cat(col):
  #convert the elements of the dataframe column into ints
   col = col.astype(int)
   #return a list of the broken up subset
   return [col < 3, col >= 3]

#if < 3, the feature is absent, else the feature is present

df["famrel"] = np.select(rate_cat(df["famrel"]),
                         ["Bad Family Relationships", "Good Family Relationships"])

df["freetime"] = np.select(rate_cat(df["freetime"]), ["No Free Time", "A lot of free time"])

df["traveltime"] = np.select(rate_cat(df["traveltime"]), ["Low Commute Time", "High commute time"])

df["studytime"] = np.select(rate_cat(df["studytime"]), ["Doesn't Study", "Studies Alot"])


df["goout"] = np.select(rate_cat(df["goout"]), ["Doesn't Go Out", "Goes out Often"])

df["Dalc"] = np.select(rate_cat(df["Dalc"]), ["Drinks Very Little on Weekdays", "Drinks on Weekdays"])

df["Walc"] = np.select(rate_cat(df["Walc"]), ["Drinks Very Little On Weekends", "Drinks on Weekends"])

df["health"] = np.select(rate_cat(df["health"]), ["Unhealthy", "Healthy"])

df["Fedu"] = np.select(rate_cat(df["Fedu"]), ["Father No Higher Education", "Father Higher Education"])

df["Medu"] = np.select(rate_cat(df["Medu"]), ["Mother No Higher Education", "Mother Higher Education"])

With this dataset, we were able to get a lot of rules at a pretty high minimum support and confidence. The rules are generally what you might expect. For example, if a student drinks very little on weekdays, they're likely to fail few or no classes which makes sense. Interestingly, students that didn't seek out school support were associated with having failed none or few classes and pursuing higher education. One possible explanation for this might be that students who don't seek out school support might be good at self studying/doing research to answer their questions. Unfortunately, none of the rules include the grades that the students got for the math class which is pretty disappointing. We converted the grade distribution into just pass/fail but still weren't getting anything even after lowering the support and confidence. We also didn't see any rules for the romance feature, if the student's in a romantic relationship, which was also disappointing for us.

In [None]:
#convert the df into a format that can be used by the library
#that implements the apriori alogirhtm
transactions = list(df.itertuples(index=False, name=None))

itemsets, rules = apriori(transactions, min_support=0.8, min_confidence= 0.93)

for rule in rules:
  print(rule)

{Drinks Very Little on Weekdays} -> {Good Family Relationships} (conf: 0.940, supp: 0.835, lift: 1.006, conv: 1.100)
{Drinks Very Little on Weekdays} -> {Low Commute Time} (conf: 0.937, supp: 0.833, lift: 1.017, conv: 1.252)
{Drinks Very Little on Weekdays} -> {higher education} (conf: 0.949, supp: 0.843, lift: 0.999, conv: 0.987)
{Failed Few or No Classes} -> {Good Family Relationships} (conf: 0.939, supp: 0.861, lift: 1.005, conv: 1.083)
{higher education} -> {Failed Few or No Classes} (conf: 0.931, supp: 0.884, lift: 1.016, conv: 1.205)
{Failed Few or No Classes} -> {higher education} (conf: 0.964, supp: 0.884, lift: 1.016, conv: 1.410)
{Low Commute Time} -> {Good Family Relationships} (conf: 0.934, supp: 0.861, lift: 1.000, conv: 0.998)
{higher education} -> {Good Family Relationships} (conf: 0.933, supp: 0.886, lift: 0.999, conv: 0.987)
{Good Family Relationships} -> {higher education} (conf: 0.949, supp: 0.886, lift: 0.999, conv: 0.983)
{no school support} -> {Good Family Relatio

Further analyzing the results we obtained, it seems as though many of the rules consist of different combinations of the same attributes. This is to be expected because if the confidence and support are high for a student that wishes to pursue higher education fails few/no classes and has a low commute time, we could also expect high confidence and support for a student that has a low commute time to fail few/no classes and to want to pursue higher education. However, in addition to this, it seems as though the same 6-7 attributes appear in every rule. The 23-24 other attributes do not make an appearance at all. Furthermore, what these rules are telling us goes along the lines of: avoiding drinking on weekdays, having good family relationships, and wanting to pursue higher education leads to students failing few or no classes in the past. This is not something we didn't already know, and isn't quite as interesting as we hoped it to be. We suspect this is because almost all of the students possessed these 6-7 attributes, but not necessarily the others, giving them high support values.

Therefore, we decided to try lowering the support to 0.3 while raising the minimum confidence to 0.98. These are the results we obtained:

In [None]:
transactions = list(df.itertuples(index=False, name=None))

itemsets, rules = apriori(transactions, min_support=0.3, min_confidence= 0.98)

for rule in rules:
  if ("Drinks Very Little On Weekends" in rule.lhs or  "Drinks Very Little on Weekdays" in rule.lhs) and ("Drinks Very Little On Weekends" in rule.rhs or  "Drinks Very Little on Weekdays" in rule.rhs):
    continue
  print(rule)

{F} -> {higher education} (conf: 0.981, supp: 0.516, lift: 1.033, conv: 2.633)
{Father Higher Education} -> {higher education} (conf: 0.985, supp: 0.489, lift: 1.037, conv: 3.308)
{Tutoring} -> {higher education} (conf: 0.994, supp: 0.456, lift: 1.048, conv: 9.165)
{A lot of free time, F} -> {higher education} (conf: 0.981, supp: 0.385, lift: 1.033, conv: 2.616)
{A lot of free time, Father Higher Education} -> {higher education} (conf: 0.981, supp: 0.385, lift: 1.033, conv: 2.616)
{A lot of free time, Tutoring} -> {higher education} (conf: 0.993, supp: 0.347, lift: 1.046, conv: 6.987)
{Below 18, F} -> {higher education} (conf: 1.000, supp: 0.380, lift: 1.053, conv: 50632911.392)
{Below 18, Failed Few or No Classes} -> {higher education} (conf: 0.985, supp: 0.658, lift: 1.037, conv: 3.342)
{Below 18, Father Higher Education} -> {higher education} (conf: 0.986, supp: 0.367, lift: 1.039, conv: 3.722)
{Below 18, Frequently Absent} -> {higher education} (conf: 0.986, supp: 0.349, lift: 1.03

**Examples of Rules Found**
________________________________________________________________________

- {Drinks Very Little on Weekdays} -> {Failed Few or No Classes} 
- Confidence: 0.963, Support: 0.856
________________________________________________________________________

- {Drinks Very Little on Weekdays} -> {Good Family Relationships} 
- Confidence: 0.940, Support: 0.835
________________________________________________________________________
- {Good Family Relationships, Low Commute Time, higher education} -> {Failed Few or No Classes}
- Confidence: 0.981, Support: 0.805
________________________________________________________________________
- {Low Commute Time, higher education} -> {Failed Few or No Classes}
- Confidence: 0.980, Support: 0.861
________________________________________________________________________
- {higher education, no school support} -> {Failed Few or No Classes}
- Confidence: 0.972, Support: 0.800
________________________________________________________________________
- {C: Math1, F: Math2} -> {F: Math3} 
- Confidence: 0.981, Support: 0.516
________________________________________________________________________
- {Father Higher Education, Mother Higher Education} -> {higher education}
-  Confidence: 0.994, Support: 0.418
________________________________________________________________________
- {Drinks Very Little on Weekdays, Failed Few or No Classes, Good Family    Relationships, Low Commute Time, no school support} -> {higher education}
- Confidence: 0.960, Support: 0.610
________________________________________________________________________