# Module 2: Handling data and deployment
Contents
- [1. Setup](#9d2860101d08aff4362723c550f1c5f32aedac5831ca420b)
- [2. Getting the data](#9d2860101d08aff4362723c550f1c5f32aedac5831ca420b)
## 1. Setup
- to install python using [Pyenv](https://realpython.com/intro-to-pyenv/)
    - Install of python ```$ pyenv install 3.7.9```
    - Set the local version of python to the installed version ```$ pyenv local 3.7.9```
    - Run the initialiser ```$ pyenv init```
    - follow the instructions printed in the CLI
    - activate shell with local python ```$ pyenv shell 3.7.9```
- To set up a virtual environment I am using [poetry](https://python-poetry.org/docs/)
    - Install poetry following these [commands](https://python-poetry.org/docs/#installation)
    - `Poetry` can help manage [multiple environments](https://python-poetry.org/docs/managing-environments/), in particular [switching between environments ](https://python-poetry.org/docs/managing-environments/#switching-between-environments)
    - After setting up poetry project can use the following command to use your `pyenv` installed python to activate `venv`.

`
$ poetry env use ~/.pyenv/versions/<python version number>/bin/python
`
- check the environment is activated using `$ poetry env list`
- initialise the environment using `$ poetry shell`

## 2. Getting the data
Download the data from the [UK Data Service](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/access-data).
The data is captured from the European Quality of Life Time Series, 2007 and 2011. You can find out more about the dataset [here](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7724#!/details).
and there is a good [user guide](http://doc.ukdataservice.ac.uk/doc/7724/mrdoc/pdf/7724_eqls_2007-2011_user_guide_v2.pdf)
describing how the data was created and anonymised and how some variables were generated to summarise other variables.
Save the data in a local filespace (for example: `$HOME/rds-course/data-analysis/data`)

In [159]:
# installing required packages
from pathlib import Path
import pandas as pd


%matplotlib inline

In [160]:
root_data_path = Path("data")
eqls_path = Path(str(root_data_path)+"/UKDA-7724-csv/csv/eqls_2007and2011.csv").resolve()
eqls_df = pd.read_csv(eqls_path)

eqls_df.head()

Unnamed: 0,Wave,Y11_Country,Y11_Q31,Y11_Q32,Y11_ISCEDsimple,Y11_Q49,Y11_Q67_1,Y11_Q67_2,Y11_Q67_3,Y11_Q67_4,...,DV_Q54a,DV_Q54b,DV_Q55,DV_Q56,DV_Q8,DV_Q10,ISO3166_Country,RowID,URIRowID,UniqueID
0,2,1,4.0,0.0,4.0,4.0,,,,,...,,,,,,,https://www.iso.org/obp/ui/#iso:code:3166:AT,1,https://api.ukdataservice.ac.uk/V1/datasets/eq...,AT9000083
1,2,1,4.0,0.0,4.0,4.0,,,,,...,,,,,,,https://www.iso.org/obp/ui/#iso:code:3166:AT,2,https://api.ukdataservice.ac.uk/V1/datasets/eq...,AT9000126
2,2,1,1.0,2.0,3.0,2.0,,,,,...,,,,,,,https://www.iso.org/obp/ui/#iso:code:3166:AT,3,https://api.ukdataservice.ac.uk/V1/datasets/eq...,AT9000267
3,2,1,2.0,0.0,3.0,1.0,,,,,...,,,,,,,https://www.iso.org/obp/ui/#iso:code:3166:AT,4,https://api.ukdataservice.ac.uk/V1/datasets/eq...,AT9000268
4,2,1,4.0,0.0,3.0,4.0,,,,,...,,,,,,,https://www.iso.org/obp/ui/#iso:code:3166:AT,5,https://api.ukdataservice.ac.uk/V1/datasets/eq...,AT9000427


### Additional info about the data
The `.info()` method of a dataframe gives us a useful summary of the columns it contains:

In [161]:
eqls_df.info()
eqls_df.dtypes
eqls_df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79270 entries, 0 to 79269
Columns: 199 entries, Wave to UniqueID
dtypes: float64(187), int64(9), object(3)
memory usage: 120.4+ MB


Unnamed: 0,Wave,Y11_Country,Y11_Q31,Y11_Q32,Y11_ISCEDsimple,Y11_Q49,Y11_Q67_1,Y11_Q67_2,Y11_Q67_3,Y11_Q67_4,...,DV_Q7,DV_Q67,DV_Q43Q44,DV_Q54a,DV_Q54b,DV_Q55,DV_Q56,DV_Q8,DV_Q10,RowID
count,79270.0,79270.0,78756.0,78769.0,78556.0,79082.0,43636.0,43636.0,43636.0,43636.0,...,2225.0,43636.0,78312.0,43636.0,43636.0,43636.0,43636.0,43636.0,43636.0,79270.0
mean,2.550473,16.841138,1.856049,1.598141,4.019146,2.640955,1.959368,1.023673,1.019204,1.001971,...,52.612135,1.086465,2.485992,2.815565,2.925635,0.303442,0.231437,3.931708,3.283482,39635.5
std,0.497449,9.35832,1.186271,1.276425,1.368993,0.987352,0.197437,0.15203,0.137244,0.044351,...,15.696943,0.460388,0.838558,0.721642,0.568403,0.881979,0.827727,0.436254,1.130667,22883.422256
min,2.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,5.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0
25%,2.0,9.0,1.0,0.0,3.0,2.0,2.0,1.0,1.0,1.0,...,43.0,1.0,2.0,3.0,3.0,0.0,0.0,4.0,2.0,19818.25
50%,3.0,16.0,1.0,2.0,4.0,3.0,2.0,1.0,1.0,1.0,...,50.0,1.0,3.0,3.0,3.0,0.0,0.0,4.0,4.0,39635.5
75%,3.0,25.0,3.0,2.0,5.0,4.0,2.0,1.0,1.0,1.0,...,61.0,1.0,3.0,3.0,3.0,0.0,0.0,4.0,4.0,59452.75
max,3.0,35.0,4.0,5.0,8.0,4.0,2.0,2.0,2.0,2.0,...,80.0,6.0,3.0,6.0,6.0,4.0,4.0,4.0,4.0,79270.0


In [162]:
eqls_df["Y11_Country"].unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 34, 31, 33,
       32])

We can see there are 79270 rows (each row corresponding to an entry) and 199 columns (each column corresponding to a variable).
There output `dtypes: float64(187), int64(9), object(3)` from the command `eqls_df.dtypes` shows that most of the columns (187)
contain numeric variables, with 9 columns containing  integers and 3 object columns (these are string columns).

## 3. Understanding the data
Lets read in the metadata explaining the content of each of the columns.

In [166]:
eqls_api_map_path = Path(str(root_data_path)+"/UKDA-7724-csv/mrdoc/excel/eqls_api_map.csv").resolve()
eqls_api_map_df = pd.read_csv(eqls_api_map_path,encoding='latin1')
eqls_api_map_df.head()


Unnamed: 0,VariableName,VariableLabel,Question,TopicValue,KeywordValue,VariableGroupValue
0,Wave,EQLS Wave,EQLS Wave,,,Administrative Variables
1,Y11_Country,Country,Country,Geographies,,Household Grid and Country
2,Y11_Q31,Marital status,Marital status,Social stratification and groupings - Family l...,Marital status,Family and Social Life
3,Y11_Q32,No. of children,Number of children of your own,Social stratification and groupings - Family l...,Children,Family and Social Life
4,Y11_ISCEDsimple,Education completed,Highest level of education completed,Education - Higher and further,Education levels,Education


## Re-naming our variables to more inutitive names
As we can see from `eqls_api_map_path` that the `VariableName` does not intuitively describe the `VariableLabel` (e.g.
`Y11_ISCEDsimple` is not a good description of "Education completed"). If we want to use more intuitive naming convention for our features
we can do so using the `VariableLabel` column.

In [164]:
def create_intuitive_feature_name(data_df, mapping_df):
    if len(data_df.columns) == len(mapping_df):
        old_column_list = list(data_df.columns.values)
        new_col_list = mapping_df['VariableLabel'].str.replace('\s', '_').str.replace('[^\w]','').tolist()
        column_dict = dict(zip(old_column_list, new_col_list))
        return data_df.rename(columns=column_dict)

eqls_df = create_intuitive_feature_name(eqls_df, eqls_api_map_df)
eqls_df.head()

  after removing the cwd from sys.path.


Unnamed: 0,EQLS_Wave,Country,Marital_status,No_of_children,Education_completed,Ruralurban_living,Citizenship__Country,Citizenship__Another_EU_member,Citizenship__A_nonEU_country,Citizenship__Dont_know,...,DV_Anyone_usedwould_have_like_to_use_child_care_last_12_months,DV_Anyone_usedwould_have_like_to_use_long_term_care_last_12_months,DV_No_of_factors_which_made_it_difficult_to_use_child_care,DV_No_of_factors_which_made_it_difficult_to_use_long_term_care,DV_Preferred_working_hours_3_groups,DV_Preferred_working_hours_of_respondents_partner_3_groups,ISO3166_Country_URL,RowID_for_the_UK_Data_service_Public_API,Root_URI_for_a_row_respondent_that_displays_all_data_values_for_a_single_row_via_the_UK_Data_Service_Public_API,Unique_respondent_ID
0,2,1,4.0,0.0,4.0,4.0,,,,,...,,,,,,,https://www.iso.org/obp/ui/#iso:code:3166:AT,1,https://api.ukdataservice.ac.uk/V1/datasets/eq...,AT9000083
1,2,1,4.0,0.0,4.0,4.0,,,,,...,,,,,,,https://www.iso.org/obp/ui/#iso:code:3166:AT,2,https://api.ukdataservice.ac.uk/V1/datasets/eq...,AT9000126
2,2,1,1.0,2.0,3.0,2.0,,,,,...,,,,,,,https://www.iso.org/obp/ui/#iso:code:3166:AT,3,https://api.ukdataservice.ac.uk/V1/datasets/eq...,AT9000267
3,2,1,2.0,0.0,3.0,1.0,,,,,...,,,,,,,https://www.iso.org/obp/ui/#iso:code:3166:AT,4,https://api.ukdataservice.ac.uk/V1/datasets/eq...,AT9000268
4,2,1,4.0,0.0,3.0,4.0,,,,,...,,,,,,,https://www.iso.org/obp/ui/#iso:code:3166:AT,5,https://api.ukdataservice.ac.uk/V1/datasets/eq...,AT9000427


<div class="alert alert-block alert-warning"> The metadata csv <code class="ph codeph">eqls_api_map.csv</code> contains
non-ascii character in the dictionary and it can't be encoded/decoded. Read in with the additional arguement
<code class="ph codeph">encoding='latin1'</code> </div>

In [165]:
#save derived data locally
eqls_df.to_csv(str(root_data_path)+"/derived_data/eqls_2011_cleaned.csv", index=False)
