# Best day of the week for a cleanup

When a person or group wants to do a beach cleanup, they have to decide on which day of the week to do it.
For example, they could choose Monday so people have the weekend free, or perhaps they would do Sunday because fewer people work on that day.
This notebook recommends which day of the week to choose using previous cleanups' data.

First, let's load in the cleanups' data.
Note that 0 = Monday, 6 = Sunday.

In [94]:
import pandas as pd
import sqlite3

old = pd.read_csv('../data/cleanups.csv')[['DateOriginal', 'NAME', 'COUNTRY']].rename(columns={'COUNTRY': 'Country', 'NAME': 'State'})
old['Day of week'] = pd.to_datetime(old['DateOriginal']).dt.dayofweek
del old['DateOriginal']
new = pd.read_csv('../data/new-cleanups.csv')
new = pd.DataFrame({'Day of week': pd.to_datetime(new['Cleanup Date']).dt.dayofweek, 
                    'State': new['State'].astype(str).apply(lambda state: state.split(',')[0] if state else None), 
                    'Country': new['Country']})
cleanups = pd.concat([new, old]).reset_index().drop(columns=['index'])
cleanups

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Day of week,State,Country
0,5.0,California,United States
1,5.0,Wisconsin,United States
2,1.0,Queensland,Australia
3,1.0,Queensland,Australia
4,1.0,La Digue,Seychelles
...,...,...,...
102751,5.0,California,United States
102752,5.0,California,United States
102753,5.0,California,United States
102754,5.0,California,United States


Now, let's start making some different models.
First, we need to make a train and a test set.

In [95]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(cleanups)

## ZeroR model

This is the simplest model.
We find the most common category and always predict it.

In [96]:
value_counts = train['Day of week'].value_counts()
most_common = value_counts.idxmax()
accuracy = list(test['Day of week'] == most_common).count(True) / len(test)
accuracy

0.4266806804468839

As you can see, the model gets 43\% accuracy.
We will use this as a baseline for comparison when we make more models.

## OneR model

Now, we want to make a OneR model.
We have two columns, the state and country of the cleanup.
Here, we calculate the column which more accurately predicts the day of the week.

In [110]:
rules = {}
for column in ['State', 'Country']:
    rules[column] = {}
    data = train[[column, 'Day of week']].dropna()
    for value in data[column].unique():
        try:
            rules[column][value] = data[data[column] == value]['Day of week'].value_counts().idxmax()
        except ValueError:
            rules[column][value] = most_common

Now, we find the accuracy.

In [128]:
for column, rule in rules.items():
    data  = test[[column, 'Day of week']]
    pred  = [rule[value] for value in data[column] if value in rule]
    truth = [day for value, day in zip(data[column], data['Day of week']) if value in rule]
    accuracy = sum(int(p == t) for p, t in zip(pred, truth)) / len(pred)
    print(column, accuracy)

State 0.45719259376483623
Country 0.4417073776600149


As you can see, using the state of the cleanup gives us slightly better accuracy, 46\%.