## Chapter 7. Exploratory data analysis
#### Notebook for Python. Aditional notebook to clean data and create the file eurobarometer.csv 

Van Atteveldt, W., Trilling, D. & Arcila, C. (2022). <a href="https://cssbook.net" target="_blank">Computational Analysis of Communication</a>. Wiley.

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ccs-amsterdam/ccsbook/blob/master/chapter07/cleaning_eurobarometer_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
</table>

In [1]:
import pandas as pd
import numpy as np

In [2]:
#url='../datasets/ZA6928_v1-0-0.csv'
url="https://media.githubusercontent.com/media/ccs-amsterdam/ccsbook/master/docs/d/ZA6928_v1-0-0.csv"
d=pd.read_csv(url, header=0, sep= ';', low_memory= False)
print ("Shape of my original data=", d.shape)

#Select and rename columns
d2 = d[['survey', 'uniqid', 'p1', 'tnscntry', 'd7', 'd8', 'd10', 'd11', 'd15a', 'd25', 'd40a', 'qd9_4', 'qd9_1']]
d2.columns = ['survey', 'uniqid', 'date', 'country', 'marital_status', 'educational', 'gender', 'age', 'occupation', 'type_community', 'household_composition', 'support_refugees', 'support_migrants']
print('Shape of my filtered data =', d2.shape)

print("Variables:", d2.columns)

Shape of my original data= (33193, 705)
Shape of my filtered data = (33193, 13)
Variables: Index(['survey', 'uniqid', 'date', 'country', 'marital_status', 'educational',
       'gender', 'age', 'occupation', 'type_community',
       'household_composition', 'support_refugees', 'support_migrants'],
      dtype='object')


In [3]:
#Replace some categories by missing values
d2['support_refugees']= d2['support_refugees'].replace('Inap. (not 1 in eu28)', np.NaN)
d2['support_refugees']= d2['support_refugees'].replace('DK', np.NaN)

#Replace age values to correct strings and convert to numeric
d2['age']= d2['age'].replace('15 years', '15')
d2['age']= d2['age'].replace('98 years', '98')
d2['age']= d2['age'].replace('99 years (and older)', '99')
d2['age']= pd.to_numeric(d2['age'])

#We transform date, support_refugees and support_migrants into new numerical variables
#Days in order
d2['date_n'] = d2['date']
d2['date_n'] =  d2['date_n'].replace("Sunday, 5th November 2017" , '1')
d2['date_n'] =  d2['date_n'].replace("Monday, 6th November 2017" , '2')
d2['date_n'] =  d2['date_n'].replace("Tuesday, 7th November 2017" , '3')
d2['date_n'] =  d2['date_n'].replace("Wednesday, 8th November 2017" , '4')
d2['date_n'] =  d2['date_n'].replace("Thursday, 9th November 2017" , '5')
d2['date_n'] =  d2['date_n'].replace("Friday, 10th November 2017" , '6')
d2['date_n'] =  d2['date_n'].replace("Saturday, 11th November 2017" , '7')
d2['date_n'] =  d2['date_n'].replace("Sunday, 12th November 2017" , '8')
d2['date_n'] =  d2['date_n'].replace("Monday, 13th November 2017" , '9')
d2['date_n'] =  d2['date_n'].replace("Tuesday, 14th November 2017" , '10')
d2['date_n'] =  d2['date_n'].replace("Wednesday, 15th November 2017" , '11')
d2['date_n'] =  d2['date_n'].replace("Thursday, 16th November 2017" , '12')
d2['date_n'] =  d2['date_n'].replace("Friday, 17th November 2017" , '13')
d2['date_n'] =  d2['date_n'].replace("Saturday, 18th November 2017" , '14')
d2['date_n'] =  d2['date_n'].replace("Sunday, 19th November 2017" , '15')
d2['date_n'] =  pd.to_numeric(d2['date_n'])

#Level of support to refugees from 1 to 4	
d2['support_refugees_n'] = d2['support_refugees']
d2['support_refugees_n'] = d2['support_refugees_n'].replace("Totally disagree" , "1")
d2['support_refugees_n'] = d2['support_refugees_n'].replace("Tend to disagree" , "2")
d2['support_refugees_n'] = d2['support_refugees_n'].replace("Tend to agree" , "3")
d2['support_refugees_n'] = d2['support_refugees_n'].replace("Totally agree" , "4")
d2['support_refugees_n'] = pd.to_numeric(d2['support_refugees_n'])

#Level of support to migrants from 1 to 4 (and replace missing valued by NaN)
d2['support_migrants']= d2['support_migrants'].replace('Inap. (not 1 in eu28)', np.NaN)
d2['support_migrants']= d2['support_migrants'].replace('DK', np.NaN)
d2['support_migrants_n'] = d2['support_migrants']
d2['support_migrants_n'] = d2['support_migrants_n'].replace("Totally disagree" , "1")
d2['support_migrants_n'] = d2['support_migrants_n'].replace("Tend to disagree" , "2")
d2['support_migrants_n'] = d2['support_migrants_n'].replace("Tend to agree" , "3")
d2['support_migrants_n'] = d2['support_migrants_n'].replace("Totally agree" , "4")
d2['support_migrants_n'] = pd.to_numeric(d2['support_migrants_n'])

#transform educational into continuous
d2['educational_n']= d2['educational']
d2['educational_n']= d2['educational_n'].replace('DK', np.NaN)
d2['educational_n']= d2['educational_n'].replace('Still studying', np.NaN)
d2['educational_n']= d2['educational_n'].replace('No full-time education', np.NaN)
d2['educational_n']= d2['educational_n'].replace('Refusal', np.NaN)
d2['educational_n'] = d2['educational_n'].replace('2 years' , '2')
d2['educational_n'] = d2['educational_n'].replace('75 years' , '75')
d2['educational_n'] = pd.to_numeric(d2['educational_n'])

#Recode country names to standard names of naturalearthdata.com (library geopandas)
#https://ramiro.org/notebook/metal-bands-map/
d2['country'] =  d2['country'].replace("BALGARIJA" , 'Bulgaria')
d2['country'] =  d2['country'].replace("BELGIQUE" , "Belgium")
d2['country'] =  d2['country'].replace("CESKA REPUBLIKA" , "Czech republic")
d2['country'] =  d2['country'].replace("DANMARK" , "Denmark")
d2['country'] =  d2['country'].replace("DEUTSCHLAND OST" , "Germany")
d2['country'] =  d2['country'].replace("DEUTSCHLAND WEST" , "Germany")
d2['country'] =  d2['country'].replace("EESTI" , "Estonia")
d2['country'] =  d2['country'].replace("ELLADA" , "Greece")
d2['country'] =  d2['country'].replace("ESPANA" , "Spain")
d2['country'] =  d2['country'].replace("FRANCE" , "France")
d2['country'] =  d2['country'].replace("GREAT BRITAIN" , "United Kingdom")
d2['country'] =  d2['country'].replace("HRVATSKA" , "Croatia")
d2['country'] =  d2['country'].replace("IRELAND" , "Ireland")
d2['country'] =  d2['country'].replace("ITALIA" , "Italy")
d2['country'] =  d2['country'].replace("KYPROS" , "Cyprus")
d2['country'] =  d2['country'].replace("LATVIA" , "Latvia")
d2['country'] =  d2['country'].replace("LIETUVA" , "Lithuania")
d2['country'] =  d2['country'].replace("LUXEMBOURG" , "Luxemburg")
d2['country'] =  d2['country'].replace("MAGYARORSZAG" , "Hungary")
d2['country'] =  d2['country'].replace("MALTA" , "Malta")
d2['country'] =  d2['country'].replace("NEDERLAND" , "Netherlands")
d2['country'] =  d2['country'].replace("NORTHERN IRELAND" , "United Kingdom")
d2['country'] =  d2['country'].replace("POLSKA" , "Poland")
d2['country'] =  d2['country'].replace("PORTUGAL" , "Portugal")
d2['country'] =  d2['country'].replace("ROMANIA" , "Romania")
d2['country'] =  d2['country'].replace("SLOVENIJA" , "Slovenia")
d2['country'] =  d2['country'].replace("SLOVENSKA REPUBLIC" , "Slovakia")
d2['country'] =  d2['country'].replace("SUOMI" , "Finland")
d2['country'] =  d2['country'].replace("SVERIGE" , "Sweden")
d2['country'] =  d2['country'].replace("ÖSTERREICH" , "Austria")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d2['support_refugees']= d2['support_refugees'].replace('Inap. (not 1 in eu28)', np.NaN)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d2['support_refugees']= d2['support_refugees'].replace('DK', np.NaN)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d2['age']= d2['age'].replace('15 years', '15')
A

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d2['country'] =  d2['country'].replace("ROMANIA" , "Romania")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d2['country'] =  d2['country'].replace("SLOVENIJA" , "Slovenia")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  d2['country'] =  d2['country'].replace("SLOVENSKA REPUBLIC" , "Slovakia")
A va

In [122]:
#Save to csv in a file
#d2.to_csv(r'eurobarom_nov_2017.csv', index=False)