In [1]:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt

In [7]:
data09 = pd.read_excel('DDS9_Data_Extract_with_labels.xlsx', encoding='latin1')
data10 = pd.read_excel('DDS10_Data_Extract_with_labels.xlsx', encoding='latin1')
data11 = pd.read_excel('DDS11_Data_Extract_with_labels.xlsx', encoding='latin1')

In [8]:
print(data09.shape)
print(data10.shape)
print(data11.shape)

(2076, 191)
(2205, 197)
(2131, 198)


By printing the shape of the three year's worth of data we can see two important pieces of take-aways:
1. Each of the years has a similar number of observations. This is good as it means we won't be overreliant on one year's data which may have undue influence on the results.
2. We have a different number of features for each year. Having 1/3 or more of observations automatically not having any data for one of the features is a problem.

In order to account for 2, the first step will be to identify which features are not shared between the years.

In [56]:
cols9, cols10, cols11 = list(data09.columns), list(data10.columns), list(data11.columns)
cols910 = [x for x in cols9 if x in cols10]
shared_cols = [x for x in cols910 if x in cols11]
len(shared_cols)

109

In [36]:
not_cols910 = [x for x in cols9 if x not in cols10]
not_cols109 = [x for x in cols10 if x not in cols9]
not_cols_comb910 = not_cols910 + not_cols109

not_cols_91011 = [x for x in not_cols_comb910 if x not in cols11]
not_cols_11910 = [x for x in cols11 if x not in not_cols_comb910]

not_cols = not_cols_91011 + not_cols_11910
len(not_cols)

204

In [45]:
cols1011 = [x for x in cols10 if x in cols11]
len(cols1011)

182

Here we can see that there are 109 shared columns between all three years and 204 columns that don't share the same name. However, given the similarity in size of years 10 and 11, we assumed that they might have greater similarity. It turns out they did, with 182 columns sharing the exact same name. Therefore, we decided to merge these two years and ignore the data from year 9. 

Before we merged the sets, we had to adjust the names of our target variable in order to make that a shared element. This would then bring a total of 183 shared columns.

In [52]:
data10['target_willing'] = data10.iloc[:, -3]
data11['target_willing'] = data10.iloc[:, -3]

The final piece before we merge the datasets is to add a column indicating the year in case that ends up proving relevant in the analysis.

In [53]:
data10['year'] = 10
data11['year'] = 11

In [58]:
merge_cols = [x for x in cols10 if x in cols11]

merge10 = data10[merge_cols]
merge11 = data11[merge_cols]

df = pd.concat([merge10, merge11])

In [60]:
df.head()

Unnamed: 0,record - Record number,"Q1r1 - To begin, what is your age?",Q4 - What is your gender?,age - you are...,Q2 - In which state do you currently reside?,region - Region,QNEW3 - What is your employment status?,Q5 - Which category best describes your ethnicity?,QNEW1 - Do you have children living in your home (excluding yourself if you are under 18)?,QNEW2 - How old are the children in your home?-0-4 years,...,"Q39rNEW1 - I would rather pay for sports information online in exchange for not being exposed to advertisements. - Using the scale below, please indicate how much you agree or disagree with the following statements. If the question does not apply to you, c","Q39rNEW2 - I would rather pay for games online in exchange for not being exposed to advertisements. - Using the scale below, please indicate how much you agree or disagree with the following statements. If the question does not apply to you, choose ""N/A.""","Q39rNEW3 - I would rather pay for music online in exchange for not being exposed to advertisements. - Using the scale below, please indicate how much you agree or disagree with the following statements. If the question does not apply to you, choose ""N/A.""","Q39rNEW4 - I would rather pay for TV shows online in exchange for not being exposed to advertisements. - Using the scale below, please indicate how much you agree or disagree with the following statements. If the question does not apply to you, choose ""N/A","Q39rNEW5 - I would rather pay for movies online in exchange for not being exposed to advertisements. - Using the scale below, please indicate how much you agree or disagree with the following statements. If the question does not apply to you, choose ""N/A.""","Q39r2 - I would be willing to provide more personal information online if that meant I could receive advertising more targeted to my needs and interests. - Using the scale below, please indicate how much you agree or disagree with the following statements.","Q39r3 - By providing more personal information online, I am worried about becoming a victim of identity theft. - Using the scale below, please indicate how much you agree or disagree with the following statements. If the question does not apply to you, cho",Q89 - Which of the following is your most frequently used mechanism to get news?,target_willing,year
0,7,31,Female,30-46,Illinois,Midwest,Unemployed,White or Caucasian (Non-Hispanic),Yes,No,...,Agree somewhat,Agree strongly,Agree strongly,Agree somewhat,Agree strongly,Agree somewhat,Agree somewhat,Social media sites,Agree somewhat,10
1,4,30,Female,30-46,Arkansas,South,Unemployed,White or Caucasian (Non-Hispanic),Yes,Yes,...,Disagree strongly,Disagree strongly,Disagree strongly,Disagree somewhat,Disagree strongly,Disagree somewhat,Agree somewhat,Social media sites,Agree strongly,10
2,8,61,Male,47-65,Alabama,South,Retired,White or Caucasian (Non-Hispanic),No,,...,Disagree strongly,Disagree strongly,Disagree strongly,Disagree strongly,Disagree strongly,Disagree strongly,Agree strongly,Television news stations,Agree strongly,10
3,3,68,Female,66 or older,New York,Northeast,Retired,White or Caucasian (Non-Hispanic),No,,...,N/A; I do not have a basis to answer,N/A; I do not have a basis to answer,N/A; I do not have a basis to answer,Disagree strongly,Disagree somewhat,Disagree strongly,Agree strongly,Television news stations,Agree somewhat,10
4,15,50,Female,47-65,Iowa,Midwest,Employed full-time or part-time,White or Caucasian (Non-Hispanic),No,,...,Disagree strongly,Disagree strongly,Disagree strongly,Disagree strongly,Disagree strongly,Disagree strongly,Agree somewhat,Television news stations,Agree somewhat,10


In [59]:
df.shape

(4336, 184)