Purpose of the script: 
Basic Sentiment Analysis of answers to questions 1, 4, 5, and 8 in the Census Consultations.

### 1. Imports and Set Up

In [350]:
import os
import pandas as pd
import numpy as np

In [None]:
# Set up working directory

cwd = os.chdir('/Users/alessia/Documents/DataScience/NLP_Project/Data')

### 2. Get Data

In [477]:
# Read in data (note header is spread over two rows)

cons0_df = pd.read_excel("The CensusCopy.xlsx",  header=None)

### 3. Transform Data

3.1. Combine the headers - now in two rows - into one unique row

In [478]:
# Explore data

cons0_df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,Respondent ID,Collector ID,Start Date,End Date,IP Address,Email Address,First Name,Last Name,Custom Data 1,Are you responding on behalf of an organisatio...,...,2. Please specify any significant uses of popu...,3. Please specify any significant additional b...,4. What would the impact be if the most detail...,,5. What would the additional benefit be if mor...,,6. Please specify any significant uses of cens...,7. What advantages or opportunities for geneal...,8. What are your views of the risks of each ce...,9. Are there any other issues that you believe...
1,,,,,,,,,,Response,...,Open-Ended Response,Open-Ended Response,Response,<b>4. 1. If you have answered high or medium i...,Response,<b>5. 1. If you have answered high or medium i...,Open-Ended Response,Open-Ended Response,Open-Ended Response,Open-Ended Response
2,3001215611,45151668,2014-01-05 02:42:21,2014-01-05 02:44:13,49.224.154.245,,,,,,...,,,,,,,,,,


In [479]:
print( cons0_df.values.shape )  # (1110, 50)

(1110, 50)


In [480]:
# Row 1: 

# propagate non-null values forward, so that if a cell contains a NaN, the cell gets the value of the cell before

row1 = cons0_df.ffill(1).values[:1, :]  

In [481]:
# Checks
print(row1.ndim)
print(row1.shape)          # (1,50)
print(row1[:, [0, -1]])    # print first and last values

2
(1, 50)
[['Respondent ID'
  '9. Are there any other issues that you believe we should be taking into account?']]


In [482]:
# Row 2: 

# replace NaN with empty cell (otherwise they will be float object, we want a list of only strings)

row2 = cons0_df.fillna('').values[1:2, :] 

In [483]:
#Checks
print(type(row2))
print(row2.ndim)
print(row2.shape)  # (1,50)
print(row2[:, [0, -1]])

<class 'numpy.ndarray'>
2
(1, 50)
[['' 'Open-Ended Response']]


In [484]:
# Combine row1 and row2 into one unique "header" row

header_row = row1 + row2

3.2. Reconstruct the dataframe

In [485]:
# Save header_row as DataFrame
header_row_df = pd.DataFrame(header_row)

# Save all other rows as dataframe
data_values_df = pd.DataFrame(cons0_df.values[2:, :])


In [486]:
# Append the two together
cons1_df = header_row_df.append(data_values_df,  
                                ignore_index=True
                               )

In [487]:
# Make first row as header
cons1_df.columns = cons1_df.iloc[0]

# Drop the first row (which is now redundant)
cons1_df = cons1_df.drop(0)

In [488]:
# Reset index 
cons1_df = cons1_df.reset_index(drop=True)    

In [490]:
# Checks
print(cons1_df.columns.values[:8])
print(cons1_df.columns.values[-1:])

['Respondent ID' 'Collector ID' 'Start Date' 'End Date' 'IP Address'
 'Email Address' 'First Name' 'Last Name']
[ '9. Are there any other issues that you believe we should be taking into account?Open-Ended Response']


### 4. Sentiment Analysis of questions 1, 4, 5 and 8

4.1. Define function to calculate polarity score for the answers in our dataset

In [395]:
# Define function to calculate polarity score for the answers in our dataset

def get_sentiment_score(data, col_ind) :
    """ Return list of polarity scores for values in the specified column """
    
    # import key modules
    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    analyser = SentimentIntensityAnalyzer()
    
    # empty list collector of scores
    sentiment_bag = []
    
    for answer in data.iloc[:, col_ind] :
        
        # no answer was provided, return NA
        if pd.isnull(answer) : 
            sentiment_bag.append(np.nan)
            
        else :
            sentiment_bag.append(analyser.polarity_scores(answer)['compound'])
    
    return(sentiment_bag)
    

4.2. Calculate Sentiment Score for answers to relevant questions: Q1, Q4, Q5, Q8

In [491]:
# Get column index of questions

idx_Q1 = cons1_df.columns.get_loc(str([col for col in cons1_df if 'census methods' in str(col)][0]))
idx_Q4 = cons1_df.columns.get_loc(str([col for col in cons1_df if '4. 1. ' in str(col)][0]))
idx_Q5 = cons1_df.columns.get_loc(str([col for col in cons1_df if '5. 1.' in str(col)][0]))
idx_Q8 = cons1_df.columns.get_loc(str([col for col in cons1_df if '8.' in str(col)][0]))


In [492]:
# Checks
idx_Q1, idx_Q4, idx_Q5, idx_Q8

(39, 43, 45, 48)

In [493]:
# Calculate and save the Sentiment Score as new columns in the dataset

cons1_df.loc[:, ('Q1_Sentiment')] = get_sentiment_score(cons1_df, idx_Q1)
cons1_df.loc[:, ('Q4_Sentiment')] = get_sentiment_score(cons1_df, idx_Q4)
cons1_df.loc[:, ('Q5_Sentiment')] = get_sentiment_score(cons1_df, idx_Q5)
cons1_df.loc[:, ('Q8_Sentiment')] = get_sentiment_score(cons1_df, idx_Q8)

In [494]:
# Take a look at the result
cons1_df.iloc[:, [idx_Q1, -4, idx_Q4, -3, idx_Q5, -2, idx_Q8, -1]]

Unnamed: 0,1. What are your views of the different census methods described in the consultation document?Open-Ended Response,Q1_Sentiment,"4. What would the impact be if the most detailed statistics for very small geographic areas and small population groups were no longer available? High, medium, low or no impact? <b>4. 1. If you have answered high or medium impact, please give further information.</b>",Q4_Sentiment,"5. What would the additional benefit be if more frequent (i.e. annual) statistics about population characteristics were available for areas like local authorities and electoral wards? High, medium, low or no additional benefit?<b>5. 1. If you have answered high or medium impact, please give further information.</b>",Q5_Sentiment,8. What are your views of the risks of each census approach and how they might be managed?Open-Ended Response,Q8_Sentiment
0,,,,,,,,
1,,,,,,,,
2,,,,,,,,
3,Moving to a primarily online census: an inevit...,-0.4585,It is important for the Census to provide data...,0.6486,Up to date statistics at postcode sector (or e...,0.4404,It is essential that any changes to census met...,0.7596
4,A regular full population census is absolutely...,0.9814,Would lose the ability to understand the local...,0.8360,It would allow the council to respond more eff...,0.9651,Measures must be put in place to ensure that n...,0.2500
5,Privacy is a clear concern with the whole coun...,0.9648,,,,,There are some users of the census who place a...,0.9619
6,,,,,,,,
7,"Neither is satisfactory, inevitably so given t...",0.1887,,,Local authorities suffer from out of date cens...,-0.5574,The first approach simply perpetuates existing...,0.4939
8,Continuance of census once a decade with on-li...,-0.0258,Historic family data will no longer be availab...,-0.5267,Frequent statistics updates are needed by loca...,-0.9020,,
9,•\tWhile a 10-year census has its uses for his...,0.9879,,,10-yearly information is now out-moded - the w...,0.8004,The 10-year census method is out-dated and the...,0.5565


In [495]:
# Summary satistics
cons1_df.iloc[:, [idx_Q1, -4, idx_Q4, -3, idx_Q5, -2, idx_Q8, -1]].describe()

Unnamed: 0,Q1_Sentiment,Q4_Sentiment,Q5_Sentiment,Q8_Sentiment
count,736.0,523.0,396.0,490.0
mean,0.388333,0.07341,0.33806,0.092375
std,0.523643,0.515057,0.425649,0.575458
min,-0.9817,-0.983,-0.9042,-0.9691
25%,0.0,-0.3182,0.0,-0.3612
50%,0.4939,0.0,0.4404,0.0
75%,0.866425,0.4404,0.690275,0.6339
max,0.9998,0.9999,0.9954,0.9988
