### Data Mining Spring 2023 Project:
#### Understanding and predicting Shark Presence in Near Shore Waters
#### Group Members:<br><br>
<p> This project has two deliverables:<br>
Deliverable 1:  Domain Understanding, Data Exploration and Preparation, Decision Trees and Random Forests (due 4/20)<br>
Deliverable 2:  Association Rules (due 5/2 2:30 pm end of exam period)<br>


#### Deliverable 2
This is Deliverable 2. You will be asked to share your results of Deliverable 1 and 2 (screenshots and file) in a discussion forum. In this notebook there will be code, as well as places for you to watch a video, read an article or answer questions. </p>

#### 1. A.  Import Libraries
<p>We are importing pandas and numpy for working with data and we are also adding another library called Orange which will handle the creation of the association rules. </p><p>You can simply run this code</p>

In [None]:
#some code so those pesky warnings from deprecated code won't appear
import warnings
def fxn():
    warnings.warn("deprecated", DeprecationWarning)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()
#the rest of the imports
#pandas for working with datasets
import pandas as pd
#numpy for working with arrays
import numpy as np
#import Orange for the association rules - since we've never used it we need to install it first
import sys
!{sys.executable} -m pip install orange3
!{sys.executable} -m pip install orange3-associate
import Orange
from orangecontrib.associate.fpgrowth import *  
from Orange.data.pandas_compat import table_from_frame

#### 1. B.  Read in the Dataset
<p>Now that we have all of our libraries, we read in the necessary dataset. We will be using the same shark attach dataset as the last notebook.</p><p>This time, we will be focusing on the categorical (non-numeric) attributes.</p>

In [None]:
# just as in the previous notebook, we need to read in the dataset
# we need to specify the encoding
# remember that the csv file is posted on the class github site
bdf = pd.read_csv('https://raw.githubusercontent.com/catawba-data-mining/CIS-3902-Data-Mining/main/sharkdata.csv', encoding="ISO-8859-1")
#let's take a look at the attributes and file size
bdf.info()

#### 1. C.  Transform the Attributes
<p>This is the same code as was used in the previous notebook. We still need to transform the data into categories.</p>

In [None]:
#change object type attributes - most of the discretized features - to categorical
#object type can be difficult to visualize and model
bdf["turtleexactdiscretizeSC"] = bdf["turtleexactdiscretizeSC"].astype('category')
bdf["TurtleexactdiscretizeNC"] = bdf["TurtleexactdiscretizeNC"].astype('category')
bdf["TurtleAttackActivityDiscretized"] = bdf["TurtleAttackActivityDiscretized"].astype('category')
bdf["Area"] = bdf["Area"].astype('category')
bdf["Attack"] = bdf["Attack"].astype('category')
bdf["Timeofattack"] = bdf["Timeofattack"].astype('category')
bdf["Beach"] = bdf["Beach"].astype('category')
bdf["DissolvedO2discretize"] = bdf["DissolvedO2discretize"].astype('category')
bdf["salinitydiscretize"] = bdf["salinitydiscretize"].astype('category')
bdf["turbiditydiscretize"] = bdf["turbiditydiscretize"].astype('category')
bdf["temperaturediscretize"] = bdf["temperaturediscretize"].astype('category')
bdf["precipitationdiscretize"] = bdf["precipitationdiscretize"].astype('category')
bdf["pressurediscretize"] = bdf["pressurediscretize"].astype('category')
bdf["windspeeddiscretize"] = bdf["windspeeddiscretize"].astype('category')
bdf["precipitationmvadiscretize"] = bdf["precipitationmvadiscretize"].astype('category')
bdf["CrabLandingsDisc"] = bdf["CrabLandingsDisc"].astype('category')
bdf["Direction"] = bdf["Direction"].astype('category')
bdf["DirectionDisc"] = bdf["DirectionDisc"].astype('category')
bdf["DirectionDiscInt"] = bdf["DirectionDiscInt"].astype('category')
bdf["MoonPhaseCat"] = bdf["MoonPhase"].astype('category')
bdf["MoonPhaseCatExtend"] = bdf["MoonPhaseIntExtend"].astype('category')
#change attack and moonphase cat to codes to help with scatter matrix visualization
#MoonPhaseCat is the actual MoonPhase as a string
#MoonPhaseCatExtended is the Extended MoonPhase
#0 is Quarter moons, 1 is wan gibb and wax cres, 2 is wax gibb and wan cres, 3 is Full and New
#DirectionDiscInt is the Wind Direction discretized
#NE = 1, E = 2, SE = 3, S = 4, W = 5, SW = 6
bdf["AttackCat"] = bdf["Attack"].cat.codes
bdf["MoonPhaseCatExtendCodes"] = bdf["MoonPhaseCatExtend"].cat.codes
bdf["DirectionDiscIntCodes"] = bdf["DirectionDiscInt"].cat.codes
#fix date time
bdf["Date"] = bdf["Date"].astype('category')
format_str = '%d/%m/%Y' # The format
bdf["Date"] = bdf["Date"].apply(pd.to_datetime)
#datetime.datetime.strptime(bdf["Date"], format_str)
#print info again on data frame and attributes
bdf.info()

In [None]:
#let's take a quick peek at the dataset to make sure everything looks good
bdf.head()

#### 1. D.  Select the relevant columns
<p>Now, we select the columns that we want to use for modeling. We are using ar, for association rules, for the set of columns.</p><p>Like before we won't use two correlated attributes for modeling.</p>

In [None]:
# now, we pick some of the attributes we are interested in for generating rules
ar = bdf[["Attack", "DissolvedO2discretize", "salinitydiscretize",
          "turbiditydiscretize", "pressurediscretize", "windspeeddiscretize", 
          "DirectionDisc", "CrabLandingsDisc", "MoonPhaseCatExtend"]]
#take a look
ar.info()

In [None]:
# and let's look at it as a dataframe
ar.head()

<font color=RED>QUESTION: </font> <p>We are only using discrete features for this portion. Could any of the other attributes be transformed to be used?  How would you do that?</p>

#### 2. A.  Encode the table

<p>We are looking for groups of attributes that occur frequently together. However, we need to do a bit of preproccessing first. </p>

<p>We need to use something called OneHot encoding to prepare the dataset. OneHot transforms the data so that every possible value is it's own column. Then, it labels the columns as 0 or 1 depending on if that value is true or false. The number of columns grows very fast, but the data is very simple. 
    
<p>Read this article on <a href="https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it-f0ae272f1179">OneHot Encoding</a>.</p>

In [None]:
# In this code, X is the new table being created and mapping stores the list of changes so we can understand our output. 
X, mapping = OneHot.encode(table_from_frame(ar))

In [None]:
# This is the transformed table.
X

#### 2. B.  Determine Frequent Itemsets

Frequent itemsets are groups of values appear together often. When discussing frequent itemsets, we often use the term support. Support is a measure of how often the item appears. We aren't usually interested in a group that only shows up one time. Usually, we use either a specific number or a percentage of our dataset. 

<a href="https://youtu.be/TcUlzuQ27iQ">Video on the algorithm being used to make the item sets. It goes on to form representative rules which are a special variation of association rules.</a><br>

In [None]:
# Now, we can call frequent itemsets from the Orange library.
# We require that the itemsets must have a support of 0.1. 
# To be considered an itemset must appear in at least 10% of our data
frequent_sets = dict(frequent_itemsets(X, .1))

In [None]:
# Here is what the itemsets look like. They are still encoded. 
frequent_sets

In [None]:
# Each key corresponds to a specific value. The code below decodes the table and shows what the it means. 
key_names = {item: '{}={}'.format(var.name, val)
  for item, var, val in OneHot.decode(mapping, table_from_frame(ar), mapping)}
key_names

In [None]:
# We store the key values for the attack category for later. We are interested in 1, which represents Attack=Yes. 
attack_keys = {1}
attack_keys

If you compare the list of key_names to the itemsets we can see which items appear together most often. Some of the sets only contain one item. 

However, this is awkward to read, and would become time consuming to work with, so we will decode the sets and transform them back into words. 

In [None]:
for set in frequent_sets:
    print(', '.join(key_names[i] for i in set), "- support: ", frequent_sets[set])

#### 3. A.  Generate Association Rules

So, now we have some frequent itemsets. We can use these to make association rules. These are one step further, and state that when some items are together, there is usually another specific item as well. 

When we generate the rules, we specify a minimum confidence. Confidence is a measure of how often the rule proves to be true. It is calculated by dividing the number of times the rule is true, over the total number of times the conditions for the rule appears. In this case, we will set the minimum confidence to 50%.

These rules have two parts, an antecedent and a consequent. The antecedent is the components and the consquent is the result. So, if you have rule that says high salinity implies no attack then the salinity is the antecdent and the attack is the consequent. 

The last part restricts the rules to ones that have a consequent and ones where that consequent is Attack=Yes. One of the downsides of association rules is that many uninteresting rules are generated. So, we usually restrict the rules we examine in some way. 

In [None]:
# We can use the library again to generate the rules. 
# We generate them using the package and then place them in a list. 
# P and Q represent the two parts of the rule, the antecedent and the consequent
rules = [(P, Q, supp, conf)
 for P, Q, supp, conf in association_rules(frequent_sets, .5)
 if len(Q) == 1 and Q & attack_keys]

In [None]:
# We print the first 20 rules
rules[:20]

In [None]:
# Finally, we use key_names to transform the rules into something easily readable and print the first 20. 
for ante, cons, supp, conf in rules[:20]:
     print(', '.join(key_names[i] for i in ante), '-->',
           key_names[next(iter(cons))],
           '(supp: {}, conf: {})'.format(supp, conf))

<font color=RED>QUESTION: </font> <p>After generating all of this do you notice any interesting patterns in the rules? List some ways you could modify the above code to learn more.</p>