Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

### Answers

_(See documentation below)_

My target is going to be a combination of two columns regarding political action. Q301 in the survey is _"Did you vote in the last parliamentary elections?"_ Q302 is _"During the last parliamentary elections, did you attend any meetings or activities related to any electoral campaign?"_. The normalized value counts show that 3 possibilites account for about 95% of the data. 
- Yes, I voted, and yes I attended meetings/activities.
- Yes, I voted, but no I did not attend meetings/activities.
- No, I neither voted nor attended meetings/activites.
Basically, it is a rough measure of political action.

This is a classification problem with roughly balanced classes (40-40-20, as shown below).

Accuracy will be my main metric, but I will also look at recall and precision.

Although the observations have dates, it is not time based data. It is supposed to all be from one "period" (although 2012-14 is a fairly large period). Thus I will use a random train-val-test split.

I know I will be excluding some, probably a lot, of the features, since this set is so large. Below I begin my explorations.

In [1]:
%%capture
import sys

if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/bsmrvl/DS-Unit-2-Applied-Modeling/tree/master/data/'
    !pip install category_encoders==2.*

else:
    DATA_PATH = '../data/'

In [2]:
import pandas as pd
pd.options.display.max_columns = 100
import numpy as np

from category_encoders import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

In [3]:
df = pd.read_csv(DATA_PATH + 'ABIII_English.csv', dtype=object, parse_dates=['date'])

In [4]:
df.head()

Unnamed: 0,qid,bid,country,date,wt,form,samp,a1,q1,q13,v13,sex,q101,q102,q102_insh,q102a,q103,q103_insh,q104,q104_insh,q105,q105_insh,q105a,q106,q2011,q2013,q2014,q2016,q2017,q20112,q20113,q20114,q20115,q20116,q20117,q20118,q20119,q202,q202_insh,q2031,q2032,q2033,q2034,q2035,q2037,q2042,q2043,q2044,q20412,q20416,...,q800d7,q800d8,q800d9,q810a,q8111,q8112,q8113,q811a1,q811a2,q811a3,q812a1,q812a2,q812a3,q812a4,q812a5,q812a6,q812a7,q812a8,q817a,q818yem,q1001,q1002,q1003,q1003t,q1003yem,q1004,q1005,q1006,q1006a,q1007,q1007a,q1009,q1010,q1011,q1011a,q1011b,q1012,q1012a,q1013,q1014,q1015,q1016,q1017,q1019_1,q1019_2,q1020jo,q2001ir,q2003,q2004ir,q2005kw
0,1,1,Algeria,2013-03-31,0.8432402610778809,Form B,Main sample,South,Laghouat,Urban,,Male,Very good,Much better,Yes,Much worse,Most people are not trustworthy,No,"Yes, for economic and political reasons",No,Fully ensured,No,Same as last year,To a limited extent,I trust it to a medium extent,I absolutely do not trust it,I trust it to a medium extent,I trust it to a medium extent,I trust it to a great extent,I trust it to a medium extent,I trust it to a medium extent,,,,,,,No,No,Good,Neither good nor bad,Neither good nor bad,Neither good nor bad,,,Bad,Bad,Bad,Very bad,,...,,,,More of personal loss,Betterment of the economic situation,Increased social justice,Social and economic justice,Yes,Yes,No,Somewhat important,Somewhat important,Somewhat important,Somewhat important,Very important,Very important,Very important,Very important,"A relationship of brotherhood, citizenship and...",,40,Male,Prepartory/Basic,,,Yes,,Full time (30 hours or more a week),Private,,Private sector employee,The job does not provide anything at the concl...,Married,Housewife,No,Yes,Muslim,,Owned,19000,38000,Our household income does not cover our expens...,We do not receive anything,Arabic,Amazigh,,,,,
1,2,1,Algeria,2013-03-31,0.6745921969413757,Form A,Main sample,South,Laghouat,Urban,,Female,Good,Somewhat better,Yes,Don't know,Most people are not trustworthy,No,Refuse,No,Not ensured,No,Worse than last year,To a limited extent,I absolutely do not trust it,I absolutely do not trust it,I trust it to a medium extent,I trust it to a great extent,I absolutely do not trust it,I absolutely do not trust it,I absolutely do not trust it,,,,,,,Yes,Yes,Bad,Bad,Bad,Neither good nor bad,,,Very bad,Very bad,Very bad,Very bad,,...,,,,More of personal loss,Betterment of the economic situation,Other,,No,Yes,,Very important,Very important,Very important,Very important,Not that important,Not that important,Very important,Very important,A relationship strained as a result of the cum...,,41,Female,Mid-level diploma (professional or technical,,,Yes,,Full time (30 hours or more a week),Public,A governmental employee,,Pension at the conclusion of service,Married,"Professional such as lawyer, accountant, teach...",Yes,Yes,Muslim,,Rented,Refuse,Refuse,Refuse,Refuse,Arabic,Amazigh,,,,,
2,3,1,Algeria,2013-03-31,0.8432402610778809,Form A,Main sample,South,Laghouat,Urban,,Male,Bad,Somewhat better,Yes,Similar,Most people are not trustworthy,No,"No, I do not think about emigrating",No,Absolutely not ensured,No,Worse than last year,Don't know,I absolutely do not trust it,I absolutely do not trust it,I trust it to a great extent,I trust it to a great extent,I absolutely do not trust it,I absolutely do not trust it,I absolutely do not trust it,,,,,,,Don't know,No,Neither good nor bad,Very bad,Very bad,Neither good nor bad,,,Very bad,Very bad,Very bad,Don't know,,...,,,,More of personal loss,Dignity,Other,,No,Yes,,Somewhat important,Somewhat important,Somewhat important,Somewhat important,Not important at all,Not important at all,Very important,Very important,A relationship strained because of foreign con...,,44,Male,Elementary,,,Yes,,Full time (30 hours or more a week),Public,A governmental employee,,Pension at the conclusion of service,Bachelor,,Yes,Yes,Muslim,,Other,20000,30000,Our household income does not cover our expens...,We do not receive anything,Arabic,Amazigh,,,,,
3,4,1,Algeria,2013-03-31,1.011888265609741,Form B,Main sample,South,Laghouat,Urban,,Female,Good,Almost the same as the current situation,No,Similar,Most people are not trustworthy,No,"Yes, for economic and political reasons",No,Ensured,No,Same as last year,To a great extent,I trust it to a medium extent,I trust it to a great extent,I trust it to a great extent,I trust it to a great extent,I trust it to a great extent,I absolutely do not trust it,I trust it to a great extent,,,,,,,"Yes, definitely",Yes,Good,Good,Good,Very good,,,Good,Very good,Good,Good,,...,,,,More of personal victory,Fighting corruption,Social and economic justice,Other,Don't know,No,Yes,Somewhat important,Somewhat important,Somewhat important,Somewhat important,Very important,Very important,Very important,Very important,"A relationship of brotherhood, citizenship and...",,56,Female,Elementary,,,No,A housewife,,,,,,Married,"Professional such as lawyer, accountant, teach...",Yes,No,Muslim,,Owned,Don't know,57000,Our household income does not cover our expens...,We do not receive anything,Arabic,Amazigh,,,,,
4,5,1,Algeria,2013-03-31,1.011888265609741,Form B,Main sample,South,Laghouat,Urban,,Male,Good,Somewhat better,No,Similar,Most people are not trustworthy,No,"No, I do not think about emigrating",No,Fully ensured,No,Better than last year,To a great extent,I trust it to a limited extent,I trust it to a medium extent,I trust it to a medium extent,I trust it to a medium extent,I trust it to a limited extent,I trust it to a medium extent,I trust it to a medium extent,,,,,,,Yes,No,Good,Good,Good,Good,,,Good,Good,Good,Good,,...,,,,More of personal victory,Betterment of the economic situation,Increased social justice,Social and economic justice,No,No,No,Not that important,Not that important,Somewhat important,Not important at all,Not that important,Refuse,Somewhat important,Somewhat important,A relationship strained because of the mistake...,,48,Male,Elementary,,,Yes,,Full time (30 hours or more a week),Private,,Private sector employee,End of service gratuity,Married,Housewife,No,Yes,Muslim,,Owned,24000,34000,Our household income does not cover our expens...,We do not receive anything,Arabic,Does not speak second language,,,,,


In [5]:
## Begin looking at combined target.

(df['q301'] + df['q302']).value_counts(normalize=True)

YesNo                   0.383213
NoNo                    0.375312
YesYes                  0.190155
NoYes                   0.026065
NoDon't know            0.011277
YesDon't know           0.003511
Don't knowDon't know    0.002566
RefuseRefuse            0.001688
Don't knowNo            0.001486
NoRefuse                0.001418
RefuseNo                0.001283
YesRefuse               0.000878
Don't knowYes           0.000473
RefuseDon't know        0.000270
RefuseYes               0.000203
Don't knowRefuse        0.000203
dtype: float64

In [6]:
## Create combined target. Remove the ~5% of data that doesn't fall into the top
## three classes.

df['target'] = (df['q301'] + df['q302']).replace({'YesYes':'Voted, attended campaign activities',
                                                  'YesNo':'Voted, did not attend campaign activities',
                                                  'NoNo':'Neither voted nor attended activities'})
target_options = ['Voted, attended campaign activities',
                  'Voted, did not attend campaign activities',
                  'Neither voted nor attended activities']

df = df[df['target'].isin(target_options)]

In [7]:
df['target'].value_counts(normalize=True)

Voted, did not attend campaign activities    0.403943
Neither voted nor attended activities        0.395615
Voted, attended campaign activities          0.200441
Name: target, dtype: float64

In [8]:
df = df.drop(columns=['q301', 'q302'])

In [9]:
df.shape

(14049, 295)

In [10]:
## Check if id columns are unique identifiers. Nope!

(df['qid'] + df['bid']).value_counts()

81            9
162           9
202           9
91            9
192           9
             ..
7727090157    1
95364         1
49934         1
773130379     1
7702704178    1
Length: 3830, dtype: int64

In [11]:
## No duplicate rows.

df.duplicated().sum()

0

In [12]:
df = df.drop(columns=['qid','bid'])

In [13]:
## Check for duplicate columns

df.T[df.T.duplicated() == True].T

Unnamed: 0,q1002
0,Male
1,Female
2,Male
3,Female
4,Male
...,...
14804,Male
14805,Female
14806,Male
14807,Female


In [14]:
df = df.drop(columns=['q1002','wt','form','samp'])

In [15]:
## Explore, clean, and create a quick test model. At the moment I'm only
## looking at religious/cultural questions.

df_opinions = pd.concat([
    df.loc[:,'q6012':'q618']
], axis=1)
df_opinions.head()

Unnamed: 0,q6012,q6013,q6014,q603,q6041,q6043,q6045,q6051,q6052,q6053,q6054,q6055,q6056,q605a,q605b1,q605b2,q6061,q6062,q6063,q6064,q6065,q6066,q6071,q6072,q6073,q6074,q6076,q6082,q6087,q608a,q608b,q609,q6101,q6105,q6106,q615,q616,q617,q618
0,I strongly agree,I strongly agree,I somewhat disagree,Yes,Constitutes an obstacle to a medium extent,Constitutes an obstacle to a limited extent,Does not constitute an obstacle whatsoever,I strongly agree,I somewhat agree,I somewhat disagree,I strongly agree,I somewhat agree,I strongly agree,I agree with the 2nd sentence,I strongly support,I somewhat support,I somewhat agree,I somewhat agree,I somewhat agree,I somewhat agree,I somewhat agree,I strongly agree,I somewhat disagree,I somewhat disagree,I strongly agree,I strongly agree,I strongly agree,I strongly agree,I strongly agree,,Laws regulating marriage and divorce shall be ...,Somewhat religious,Always,Always,Most of the time,They do not match closely,Sharia is the human interpretation of the word...,Not much,A little
1,I somewhat disagree,I strongly agree,I somewhat disagree,Don't know,Constitutes an obstacle to a limited extent,Constitutes an obstacle to a limited extent,Constitutes an obstacle to a limited extent,I strongly agree,I strongly agree,I strongly agree,I strongly agree,I strongly agree,I strongly agree,I Strongly agree with the 1st sentence,I strongly support,I strongly support,Don't know,Don't know,Don't know,Don't know,Don't know,Don't know,Don't know,I somewhat agree,I somewhat agree,I somewhat agree,I somewhat agree,Don't know,Don't know,Laws regulating marriage and divorce shall be ...,,Somewhat religious,Always,Most of the time,Sometimes,They do not match at all,Sharia is the human interpretation of the word...,Not much,Not much
2,I somewhat agree,I somewhat agree,I strongly disagree,"No, I would not participate on principle",Constitutes an obstacle to a great extent,Constitutes an obstacle to a limited extent,Constitutes an obstacle to a limited extent,I somewhat disagree,I somewhat agree,I somewhat agree,I somewhat disagree,I somewhat agree,I somewhat agree,I do not agree with either sentence (Do not read),I strongly support,Don't know,I somewhat agree,I somewhat disagree,I somewhat disagree,I somewhat agree,I somewhat agree,I somewhat agree,I somewhat disagree,I somewhat disagree,I somewhat disagree,I somewhat disagree,I somewhat disagree,Don't know,Don't know,Laws regulating marriage and divorce shall be ...,,Refuse,Refuse,Refuse,Refuse,They do not match closely,Sharia is the human interpretation of the word...,Refuse,Not much
3,I strongly agree,I somewhat agree,I strongly disagree,"Yes, definitely",Constitutes an obstacle to a medium extent,Constitutes an obstacle to a medium extent,Constitutes an obstacle to a medium extent,I strongly agree,I somewhat agree,I somewhat agree,I somewhat agree,I somewhat agree,I strongly agree,I agree with the 1st sentence,I strongly support,Don't know,I somewhat agree,Don't know,I strongly agree,I strongly agree,I strongly agree,I strongly agree,Don't know,I somewhat agree,I strongly agree,I strongly agree,I somewhat disagree,I strongly agree,I strongly agree,,Laws regulating marriage and divorce shall be ...,Somewhat religious,Always,Rarely,Sometimes,They do not match closely,Sharia is the human interpretation of the word...,Not much,A lot
4,I somewhat disagree,I strongly agree,I somewhat agree,"No, because I would not win and I do not want ...",Constitutes an obstacle to a medium extent,Constitutes an obstacle to a medium extent,Does not constitute an obstacle whatsoever,Don't know,I strongly agree,I somewhat agree,I somewhat agree,I somewhat agree,I somewhat agree,I agree with the 2nd sentence,I somewhat support,I do not support,I somewhat agree,I strongly disagree,I strongly disagree,I strongly agree,I somewhat agree,I somewhat agree,I somewhat disagree,I somewhat disagree,I somewhat agree,I somewhat agree,I strongly agree,I somewhat agree,I somewhat agree,,Laws regulating marriage and divorce shall con...,Religious,Always,Always,Most of the time,They match to some extent,Sharia is the human interpretation of the word...,A lot,A little


In [16]:
nulls = df_opinions.isnull().sum()
nulls[nulls>0]

q608a    7036
q608b    7013
q617      551
q618      549
dtype: int64

In [17]:
df_opinions['q608'] = df_opinions['q608a'].replace(np.NaN, '') + df_opinions['q608b'].replace(np.NaN, '')
df_opinions = df_opinions.drop(columns=['q608a','q608b'])

In [18]:
nulls = df_opinions.isnull().sum()
nulls[nulls>0]

q617    551
q618    549
dtype: int64

In [19]:
df_opinions['q617'] = df_opinions['q617'].replace(np.NaN, 'Missing')
df_opinions['q618'] = df_opinions['q618'].replace(np.NaN, 'Missing')

In [20]:
nulls = df_opinions.isnull().sum()
nulls[nulls>0]

Series([], dtype: int64)

In [21]:
X = df_opinions
y = df['target']

In [22]:
X_train, X_t, y_train, y_t = train_test_split(X, y, test_size=.3, random_state=42)

In [23]:
X_val, X_test, y_val, y_test = train_test_split(X_t, y_t, test_size=.5, random_state=42)

In [24]:
model = make_pipeline(
    OrdinalEncoder(),
    DecisionTreeClassifier(random_state=42, max_depth=10)
)
model.fit(X_train, y_train)

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['q6012', 'q6013', 'q6014', 'q603',
                                      'q6041', 'q6043', 'q6045', 'q6051',
                                      'q6052', 'q6053', 'q6054', 'q6055',
                                      'q6056', 'q605a', 'q605b1', 'q605b2',
                                      'q6061', 'q6062', 'q6063', 'q6064',
                                      'q6065', 'q6066', 'q6071', 'q6072',
                                      'q6073', 'q6074', 'q6076', 'q6082',
                                      'q6087', 'q609', ...],
                                mapping=[{'col': 'q6012',
                                          'data_type': dtype('O'),
                                          'ma...
                                          'data_type': dtype('O'),
                                          'mapping': I strongly agree       1
I somewhat agree       2
I strongly disagree    3
I somewhat disag

In [25]:
print("Train:",model.score(X_train, y_train))
print("Val:",model.score(X_val, y_val))

Train: 0.602196461256864
Val: 0.4219269102990033


In [None]:
## Pretty terrible so far!