In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
wvs = pd.read_csv('../raw_data/WVS_TimeSeries_4_0.csv')

FileNotFoundError: [Errno 2] No such file or directory: '../raw_data/WVS_TimeSeries_4_0.csv'

## Problem Statement ##

Authoritarian backsliding has been one of the most concerning trends among the Democratic world over the last several years. In order to gain and keep power, authoritarian leaders need followers who support them and what they stand for, at least at first. 

In [None]:
wvs.head()

In [None]:
wvs.shape

In [None]:
wvs.COUNTRY_ALPHA.value_counts()

In [None]:
wvs.columns

There are a number of pre-engineered aggregate variables at the end of the dataset. Each of these has the prefix 'Y'. I'm not inclined to use them in my models because the literature on which questions they aggregate, and the methodology for doing so, isn't great. There also seems to be a fair number of null values. As I searched through these, I was somewhat interested in variable **Y011A**, which is described in the data dictionary as "AUTHORITY". After some further digging, however, I discovered that this statistic is meant to measure the respondents defiance of authority, rather than inclination towards authoritarian values. 

In [None]:
plt.hist(wvs['Y011A'], bins=20);

In [None]:
wvs.Y011A.isna().sum()

The data for the survey has thus far been collected in seven waves, with each wave taking a few years to complete. The questions asked and the countries participating have changed and expanded over the course of each successive wave. Variable **S002VS** indicates the wave during which each interview was conducted:

In [None]:
wvs['S002VS'].value_counts(dropna=False).sort_index()

My initial plan was to create time series data by aggregating by year and country all the way back to the first wave, which started in 1981. One quick look at variable **S020**, which indicates the year in which each interview was conducted, shows why this won't be possible. There are big gaps in time between waves 1 and 2 (1985-1988), as well as between waves 2 and 3 (1992-1994).

Luckily, this issue dissapates thereafter, with successive waves following wave 3 starting almost as soon as the previous wave completes. As a result, I should be able to create my time series data starting with the first year of wave 3 in 1995. Starting with wave 3 does have some additional benefits, as wave 3 marked a significant expansion in the scope of the survey both in terms of the questions asked and the countries included.  

There is still a one year gap in 2015. Once I have the data in the shape I want it I'll consider various methods of interpolation in order to fill the gap

In [None]:
wvs['S020'].value_counts(sort=False).sort_index()

I'll start constructing the dataset for my time series model by removing interviews collected prior to 1995

In [None]:
wvs_ts = wvs[wvs['S020'] >= 1995].copy()
wvs_ts.shape

Here is a list of countries included in the data starting with wave 3

In [None]:
wvs_ts['COUNTRY_ALPHA'][wvs_ts['S002VS'] == 3].value_counts().sort_index()

## Sorting Countries Into Regions

Unfortunately, I'm discovering that when breaking things down to the country level there are more gaps year by year, with most countries completing their interviews for each wave in a single year, and some countries not involved in every wave. In order to create a dataset I can use for time series I'll need to group the countries into **regions**, so that I have data for every year. 

In [None]:
def waves_years(country_code):
    country_df = wvs_ts[['COUNTRY_ALPHA', 'S002VS', 'S020']][wvs_ts['COUNTRY_ALPHA'] == country_code]
    unique_values = country_df.drop_duplicates(subset = 'S020').sort_values(by='S020')
    return unique_values

In [None]:
waves_years('ROU')

In [None]:
waves_years('USA')

In [None]:
waves_years('BRA')

### Limitations of This Approach
##### ...Plus one Benefit

I'm a bit disappointed, as i was looking forward to seeing the relationships between countries in a given region. I also see further limitations here. Each annual data point from a given region will now be from a mostly different mix of countries than the previous annual data point. This could obscure trends, and cause other issues. Overall I am not that optimistic about how this change will affect the quality of the data I get from my time-series model. 

One positive is that by going this route I can include data from countries that were added to the survey after wave 3

### Defining **Regions**

I'll define 6 regions for the purposes of this variable:
1) Africa
2) Asia Pacific
3) Middle-East/North Africa
4) Eastern Europe
5) Western Europe/North America
6) Latin America/Caribbean

I've chosen these regions to be *relatively* idealogically and geographically coherent

In [None]:
region_dict = {
    'Africa': ['AGO', 'BEN', 'BWA', 'BFA', 'BDI', 'CMR', 'CPV', 'CAF', 'TCD',
               'COM', 'COG', 'CIV', 'DJI', 'GNQ', 'ERI', 'ETH', 'GAB', 'GMB',
               'GHA', 'GIN', 'GNB', 'KEN', 'LSO', 'LBR', 'MDG', 'MWI', 'MLI',
               'MRT', 'MUS', 'MYT', 'MOZ', 'NAM', 'NER', 'NGA', 'STP', 'REU',
               'RWA', 'ST', 'SEN', 'SYC', 'SLE', 'SOM', 'ZAF', 'SSD', 'SHN',
               'SDN', 'SWZ', 'TZA', 'TGO', 'UGA', 'COD', 'ZMB', 'TZA', 'ESH',
               'ZWE'],
    'Asia Pacific': ['AFG', 'AUS', 'BGD', 'BTN', 'BRN', 'KHM', 'CHN', 'CXR', 'CCK',
                     'IOT', 'FJI', 'PYF', 'GUM', 'HKG', 'IND', 'IDN', 'JPN', 'KAZ',
                     'PRK', 'KOR', 'KGZ', 'LAO', 'MAC', 'MYS', 'MDV', 'MHL', 'FSM',
                     'MNG', 'MMR', 'NPL', 'NZL', 'NFK', 'MNP', 'PAK', 'PLW', 'PNG',
                     'PHL', 'PCN', 'WSM', 'SGP', 'SLB', 'LKA', 'TWN', 'THA', 'TLS',
                     'TUV', 'VNM', 'UZB', 'TJK'],
    'Middle East/North Africa': ['DZA', 'BHR', 'EGY', 'IRN', 'IRQ', 'ISR', 'JOR', 'KWT', 'LBN',
                                 'LBY', 'MLT', 'MAR', 'OMN', 'PSE', 'QAT', 'SAU', 'SYR', 'TUN',
                                 'ARE', 'YEM', 'TUR', 'ARM', 'AZE'],
    'Eastern Europe': ['BLR', 'BGR', 'CZE', 'GEO', 'HUN', 'MDA', 'POL', 'ROU', 'RUS', 
                       'SVK', 'UKR', 'ALB', 'BIH', 'HRV', 'EST', 'LTU', 'LVA', 'MKD',
                       'MNE', 'SRB', 'SVN'],
    'Western Europe/North America': ['AUT', 'BEL', 'CAN', 'DNK', 'FIN', 'FRA', 'DEU', 'GRC', 'ISL',
                                     'ITA', 'LUX', 'NLD', 'NOR', 'PRT', 'ESP', 'SWE', 'CHE', 'GBR',
                                     'USA', 'IRL', 'CYP', 'MCO', 'AND', 'NIR'],
    'Latin America/Caribbean': ['ATG', 'ARG', 'BHS', 'BRB', 'BLZ', 'BOL', 'BRA', 'CHL', 'COL',
                                'CRI', 'CUB', 'DMA', 'DOM', 'ECU', 'SLV', 'GRD', 'GTM', 'HTI',
                                'HND', 'JAM', 'MEX', 'NIC', 'PAN', 'PRY', 'PER', 'KNA', 'LCA',
                                'VCT', 'SUR', 'TTO', 'URY', 'VEN', 'GUY', 'GUF', 'TCO', 'CYM',
                                'PRI']
}

In [None]:
def get_region(country_code):
    for region, countries in region_dict.items():
        if country_code in countries:
            return region
    
    return np.nan

wvs_ts['region'] = [get_region(country_code) for country_code in wvs_ts['COUNTRY_ALPHA']]

In [None]:
wvs_ts['region'].value_counts(dropna=False)

## Engineering an **Authoritarianism Index** to Serve as My Target Variable
I'll engineer a variable that I'll term the **Authoritarianism Index** to serve as the **y** variable for my models. I'll use this as the **y** for both my time series model and my traditional model.

This variable will be a composite score based on each respondent's answers to several questions 

### Questions to include

In order for a question to be considered for inclusion in my **Authoritarianism Index**, it needs to meet the following criteria:
- The question must be directly related to values associated with Authoritarianism
- The question must appear in all waves of the survey included in the model (waves 3-7)
- The response scale for the question must be ordinal in nature

Based on these criteria, I selected the following four questions to be included in the **Authoritarianism Index** variable:
| Question ID | Question Description | Response Scale | Directionality |
|---|---|---|---|
| A042 | "Here is a list of qualities that children can be encouraged to learn at home. Which, if any, do you consider to be especially important? Please choose up to five" (**Obedience**) | 0 (not mentioned) to 1 (important) | Positive |
| E018 | "I'm going to read out a list of various changes in our way of life that might take place in the near future. Please tell me for each one, if it were to happen, whether you think it would be a good thing, a bad thing, or don't you mind?" (**Greater respect for authority**) | 1 (good thing) to 3 (bad thing) | Negative |
| E114 | "I'm going to describe various types of political systems and ask what you think about each as a way of governing this country. For each one, would you say it is a very good, fairly good, fairly bad or very bad way of governing this country?" (**Having a strong leader**) | 1 (very good) to 4 (very bad) | Negative |
| E116 | "I'm going to describe various types of political systems and ask what you think about each as a way of governing this country. For each one, would you say it is a very good, fairly good, fairly bad or very bad way of governing this country?" (**Having the army rule**) | 1 (very good) to 4 (very bad) | Negative |


*Scales determined with reference to https://www.worldvaluessurvey.org/WVSOnline.jsp*

***Note:*** *The directionality of the scales is at times inconsistent. Eg. for question **A042** the value associated with the positive answer is higher, while for the other three questions it is lower. I've noted the directionality of the scales in the above table. I'll need to account for that when engineering my composite variable*

In [None]:
wvs_ts['A042'].value_counts()

In [None]:
wvs_ts['E018'].value_counts()

In [None]:
wvs_ts['E114'].value_counts()

In [None]:
wvs_ts['E116'].value_counts()

### Handling Missing Values

Notice that each variable above has a number of negative values associated with it. Each of these negative numbers corresponds to a different type of **missing data**, which are as follows:
- **-1**: Respondent answered "Don't know" to question
- **-2**: Respondent refused or otherwise provided "No answer" to question
- **-3**: Question "Not applicable". Subject screened out of question by virtue of a response to a filter question
- **-4**: Question was "Not asked in survey"
    - ***Note**: My expectation was that this related to differences in questions asked from one wave of the survey to another. As I specifically chose questions based on whether or not they appeared in all waves, I did not expect to see many of these. The fact that there are still a number of missing values of this type belies that assumption.*
- **-5**: "Missing: other"

I want to draw special attention to type **-1** here. In contrast to the other missing value types, these are not true *missings*, as the respondents in these cases were asked the question and provided an answer of a kind. It is reasonable to read a response of "I don't know" as a neutral response to many of these questions, particularly since respondents are often not offered a **neutral option** as one of the response choices. I'll handle these by creating that **neutral option** at the midpoint, (eg. a four point scale becomes a five point scale with the third option as **neutral**) and assigning items currently coded as **-1** to it. I'll do this as part of a number of adjustments I intend to make to the scales of these variables, in preparation for combining them into my **Authoritarianism Index** variable.

I considered doing the same for missing value type **-2**, but ultimately decided it could not be justified. Neutrality is one possible explanation for why a respondent might not answer a question. Another might be that the respondent feels ashamed of their opinion, or fears the judgement of the interviewer. There could be many other explanations. With no answer given, there is not enough information to determine the respondent's intent. Values of **-2** will therefore be recoded as NAs along with missing value types **-3**, **-4**, and **-5**.

For the purpose of constructing the **Authoritarianism Index** variable, observations will only be dropped if they have NAs for all four of the component questions. Otherwise, their **Authoritarianism Index** will be calculated using the available responses, and scaled to match the rest of the data.

*Meaning of missing values detailed in **WVS-7 Master Questionnaire 2017-2020 English.pdf**, which can be downloaded from https://www.worldvaluessurvey.org/WVSDocumentationWV7.jsp*

In [None]:
def missing_to_na(var):
    missings = [-2, -3, -4, -5]
    return var.replace(missings, np.nan)

In [None]:
aut_index_comp_qs = ['A042', 'E018', 'E114', 'E116']

for q in aut_index_comp_qs:
    wvs_ts[q] = missing_to_na(wvs_ts[q])

In [None]:
wvs_ts['E018'].value_counts(dropna=False)

### Standardizing Ordinal Scales For the Component Variables

I'll need to create a function next to standardize the ordinal scales for each of the component variables that will make up the **Authoritarianism Index**. The adjusted scales should satisfy the following requirements:
1) The scales should have the same endpoints, so that summing them gives each component variable equal weight in the derived variable
2) The scales should all be directionally the same, meaning that positive and negative responses are on the same end of the scale for each variable
3) The scales should have an odd number of potential values, allowing there to be a midpoint that corresponds to a neutral value (*See **Handling Missing Values** above*)

In order to satisfy these requirements, I'll create a function that adjusts the scale for each of the above questions to be a five point scale from -2 to 2, where -2 indicates a strongly negative response to the question, and 2 being a strongly positive response. "Don't know" responses, currently coded as -1, will be recoded as 0

I'll reuse this function later to rescale other questions, some of which are on an eight or ten point scale, so I'll include that as well

In [None]:
def scale_adjuster(var, direction='positive'):
    orig_scale = var.max()
    if direction == 'positive':
        if orig_scale == 1:
            return var.map({-1: 0, 0: -2, 1: 2})
        elif orig_scale == 2:
            return var.map({-1: 0, 1: -2}) # 2 (strong positive) in this case would stay the same
        elif orig_scale == 3:
            return var.map({-1: 0, 1: -2, 2: 0, 3: 2})
        elif orig_scale == 4:
            return var.map({-1: 0, 1: -2, 2: -1, 3: 1, 4: 2})
        elif orig_scale == 8:
            return var.map({-1: 0, 1: -4, 2: -3, 3: -2, 4: -1, 5: 1, 6: 2, 7: 3, 8: 4})
        else:
            return var.map({-1: 0, 1: -5, 2: -4, 3: -3, 4: -2, 5: -1, 6: 1, 7: 2, 8: 3, 9: 4, 10: 5})
    else:
        if orig_scale == 1:
            return var.map({-1: 0, 0: 2, 1: -2})
        elif orig_scale == 2:
            return var.map({-1: 0, 1: 2, 2: -2})
        elif orig_scale == 3:
            return var.map({-1: 0, 1: 2, 2: 0, 3: -2})
        elif orig_scale == 4:
            return var.map({-1: 0, 1: 2, 2: 1, 3: -1, 4: -2})
        elif orig_scale == 8:
            return var.map({-1: 0, 1: 4, 2: 3, 3: 2, 4: 1, 5: -1, 6: -2, 7: -3, 8: -4})
        else:
            return var.map({-1: 0, 1: 5, 2: 4, 3: 3, 4: 2, 5: 1, 6: -1, 7: -2, 8: -3, 9: -4, 10: -5})

In [None]:
negative_scaled_qs = ['E018', 'E114', 'E116']

for q in negative_scaled_qs:
    wvs_ts[q] = scale_adjuster(wvs_ts[q], direction='negative')
    
wvs_ts['A042'] = scale_adjuster(wvs_ts['A042'])

In [None]:
wvs_ts['E114'].value_counts()

### Creating the **Authoritarianism Index** Variable

Finally, I'll create the **Authoritarianism Index** variable itself. I'll do this by summing the responses to the individual questions, then dividing by the number of responses used in the calculation. In this way, values generated from respondents for whom there were missing responses to one or more of the component questions will be on the same scale as values generated from all four component questions

In [None]:
def make_composite(components):
    components_notna = []
    
    for component in components:
        if not np.isnan(component):
            components_notna.append(component)
    
    if len(components_notna) > 0:
        composite = sum(components_notna) / len(components_notna)
    else:
        composite = np.nan
    
    return composite

In [None]:
wvs_ts['authoritarianism_index'] = [make_composite(row) for row in wvs_ts[aut_index_comp_qs].values]

We can see that values for the **Authoritarianism Index** are on the same -2 to 2 scale as the individual components:

In [None]:
wvs_ts[['authoritarianism_index']].value_counts(sort=False).sort_index

And by engineering the variable in the way that I did, I've cut way down on null values without having to eliminate observations:

In [None]:
wvs_ts['authoritarianism_index'].isna().sum()

I will have to drop these few observations from my dataset prior to modelling, but I'm quite satisfied with how little data will be lost at this stage

In [None]:
wvs_ts = wvs_ts[wvs_ts['authoritarianism_index'].notna()].copy()

List of steps in any order:
- **X** figure out what to do with null values 
- **X** create y (authoritarianism index) from the questions above 
- **X** make list of secondary questions to use in X 
- think about how to incorporate a 'freedom index', perhaps as a target value, as an alternative to time series
- **X** figure out how to divide world into regions
    - then do it
- group by year
- use interpolation to fill in missing year 2015
- do some of the time series EDA stuff
- make lots of visualizations
- construct a time series model, probably RNN, but should also do a simpler model
- write out problem statement
- write out methodology

## Other Potentially Correlated Questions

The questions on the list below do not address **Authoritarianism** directly, but rather concern other values that may be correlated with authoritarian thinking. These will be the X variables for my regression model, and some may be included as secondary variables in my time series model as well. I'll be very interested in which ones correlate most highly with my **Authoritarianism Index**

| Question ID | Question Description | Response Scale | Directionality |
|---|---|---|---|
| A004, A005, A006 | "For each of the following aspects, indicate how important it is in your life" (**Politics, Work, Religion**) | 1 (very important) to 4 (Not at all important) | Negative |
| A008 | "Taking all things together, would you say you are:" (**Happiness**) | 1 (very happy) to 4 (not at all happy) | Negative |
| A029, A030, A032, A034, A035, A039, A040, A041 | "Here is a list of qualities that children can be encouraged to learn at home. Which, if any, do you consider to be especially important? Please choose up to five" (**Independence, Hard work, Feeling of responsibility, imagination, tolerance and respect for other people, determination/perseverance, religious faith, unselfishness**) | 0 (not mentioned) to 1 (important) | Positive |
| A124_02, 03, 06, 07, 08, 09, 12, 17 | "On this list are various groups of people. Could you please mention any that you would not like to have as neighbors?" (**People of a different race, Heavy drinkers, Immigrants/foreign workers, People who have AIDS, Drug addicts, Homosexuals, People of a different religion, Gypsies**) | 0 (not mentioned) to 1 (mentioned) | Positive |
| A165 | "Generally speaking, would you say that most people can be trusted or that you need to be very careful in dealing with people?" | 1 (Most people can be trusted) to 2 (Need to be very careful) | Negative |
| A170 | "All things considered, how satisfied are you with your life as a whole these days?" | 1 (Dissatisfied) to 10 (Satisfied) | Positive |
| D054 | "One of my main goals in life has been to make my parents proud" | 1 (Strongly agree) to 4 (Strongly disagree) | Negative |
| D059 | "Men make better political leaders than women do" | 1 (Strongly agree) to 4 (Strongly disagree) | Negative |
| D060 | "University is more important for a boy than for a girl" | 1 (Strongly agree) to 4 (Strongly disagree) | Negative |
| E003 | "If you had to choose, which one of the things on this card would you say is most important?" | **Categorical**: 1 = 'Maintaining order in the nation', 2 = 'Giving people more say in important government decisions', 3 = 'Fighting rising prices', 4 = 'Protecting freedom of speech' | Not Applicable |
| E012 | "Of course, we all hope that there will not be another war, but if it were to come to that, would you be willing to fight for your country?" | 0 (No) to 1 (Yes) | Positive |
| E015, E016 | "I'm going to read out a list of various changes in our way of life that might take place in the near future. Please tell me for each one, if it were to happen, whether you think it would be a good thing, a bad thing, or don't you mind?" (**Less importance placed on work, More emphasis on technology**) | 1 (good thing) to 3 (bad thing) | Negative |
| E023 | "How interested would you say you are in politics?" | 1 (Very interested) to 4 (Not at all interested) | Negative |
| E069_01, 02, 04, 05, 06, 10, 11, 12, 13, 14, 15 | "I am going to name a number of organizations. For each one, could you tell me how much confidence you have in them" (**Churches, Armed Forces, Labour Unions, Police, Television, The Government, Political Parties, Major Companies, Environmental Protection Movement, Women's Movement**) | 1 (A great deal) to 4 (None at all) | Negative |
| F028 | "Apart from weddings, funerals and christenings, about how often do you attend religious services these days?" | 1 (More than once a week) to 8 (Never, practically never) | Negative |
| F063 | "How important is God in your life?" | 1 (Not at all important) to 10 (Very important) | Positive |
| F116, F117, F118, F119, F120, F121, F122, F123 | "Please tell me for each of the following actions whether you think it can always be justified, never be justified, or something in between" (**Cheating on taxes, Someone accepting a bribe, Homosexuality, Prostitution, Abortion, Divorce, Euthanasia, Suicide**) | 1 (Never justifiable) to 10 (Always justifiable) | Positive |

***Note**: Response scales are **ordinal** unless otherwise specified*

*Scales determined with reference to https://www.worldvaluessurvey.org/WVSOnline.jsp*

## Demographic Variables

I'll also incorporate some demographic variables into the regression model. Observations in the time series model will be aggregates by year and region of the world, so it won't make sense to use individual demographic data in that model

| Variable ID | Variable Description | Response Scale | Directionality | Model |
|---|---|---|---|---|
| X001 | sex | **Categorical**: 1 = male, 2 = female | NA | Regression |
| X003 | Age (at time of interview) | **Numeric** | NA | Regression |
| X007 | Marital status | **Categorical**: 1 = 'Married', 2 = 'Living together', 3 = 'Divorced', 4 = 'Separated', 5 = 'Widowed', 6 = 'Single' | NA | Regression |
| X011 | Number of children | **Numeric** | NA | Regression |
| X025R | Education level (Recoded into three groups) | 1 (Lower) to 3 (Higher) | Positive | Regression |
| X028 | Employment status | **Categorical**: 1 = 'Full time (> 30hr/wk)', 2 = 'Part time (< 30hr/wk)', 3 = 'Self employed', 4 = 'Retired/pensioned', 5 = 'Housewife not otherwise employed', 6 = 'Student', 7 = 'Unemployed', 8 = 'Other' | NA | Regression |
| X045 | Social class (subjective/self described) | 1 (Upper class) to 5 (Lower class) | Negative | Regression |
| X047R_WVS | Income level (subjective, recoded into three groups) | 1 (low) to 3 (high) | Positive | Regression |
| COUNTRY_ALPHA | Country of Respondent | Categorical: Three-Letter Country Code | NA | Time-Series |
| S002VS | Survey Wave | Numeric | NA | Time-Series |
| S020 | Year of Interview | Numeric/Datetime | NA | Time-Series |

***Note**: Response scales are **ordinal** unless otherwise specified*

*Scales determined with reference to https://www.worldvaluessurvey.org/WVSOnline.jsp*

### Transforming X Variables
I'll do the same two transformations on these variables as I did for the component variables of the **Authoritarianism Index**:
1) Handling Missing Values
2) Standardizing Ordinal Scales

In [None]:
positive_scaled_qs = ['A029', 'A030', 'A032', 'A034', 'A035', 'A039', 'A040', 'A041',
                      'A124_02', 'A124_03', 'A124_06', 'A124_07', 'A124_08', 'A124_09',
                      'A124_12', 'A124_17', 'A170', 'E012', 'F063', 'F116', 'F117',
                      'F118', 'F119', 'F120', 'F121', 'F122', 'F123', 'X025R', 'X047R_WVS']

negative_scaled_qs = ['A004', 'A005', 'A006', 'A008', 'A165', 'D054', 'D059', 'D060',
                      'E015', 'E016', 'E023', 'E069_01', 'E069_02', 'E069_04', 'E069_05',
                      'E069_06', 'E069_10', 'E069_11', 'E069_12', 'E069_13', 'E069_14',
                      'E069_15', 'F028', 'X045']

all_ordinal_qs = positive_scaled_qs + negative_scaled_qs

non_ordinal_qs = ['E003', 'X001', 'X007', 'X028', 'COUNTRY_ALPHA', 'S002VS', 'S020']

all_qs = all_ordinal_qs + non_ordinal_qs

In [None]:
for q in all_ordinal_qs:
    wvs_ts[q] = missing_to_na(wvs_ts[q])

In [None]:
for q in negative_scaled_qs:
    wvs_ts[q] = scale_adjuster(wvs_ts[q], direction='negative')
    
for q in positive_scaled_qs:
    wvs_ts[q] = scale_adjuster(wvs_ts[q])

In [None]:
wvs_ts['X047R_WVS'].value_counts(dropna=False)

## Modelling Dataset

I'll now create my base modelling dataset by combining my two engineered variables ('**region**' and '**authoritarianism_index**') with the other variables I've selected

In [None]:
engineered_vars = ['region', 'authoritarianism_index']
all_vars = engineered_vars + all_qs

wvs_model = wvs_ts[all_vars]

In [None]:
wvs_model.shape

In [None]:
wvs_model.to_csv('./data/wvs_model.csv', index=False)

This is the point where the shape of the data for my regression and time-series models will begin to diverge. I'll continue the process of shaping and preparing the data for modelling in the next two notebooks on EDA and data visualization