:::{.column-page}
![](human-hugging-computer.webp)
:::

Pew Research releases a lot of interesting survey datasets about people's opinions on a wide range of topics. One survey dataset they released is from a survey conducted in 2022 about people's opinions on artificial intelligence.

The question I wanted to explore here is, how accurately can we predict someone's opinion on AI based on other demographic data?

::: {.callout-tip}
[Open this notebook in Colab](https://colab.research.google.com/github/geirfreysson/ai-experiments/blob/main/posts/11-feb-2024-fpl/index.ipynb) to try for yourself.
:::


## Import and explore
First, we import pandas and explore the data.

In [6]:
!wget -q https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv

In [1]:
import pandas as pd
import tensorflow as tf
#from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

all_data = pd.read_spss('pew_research_ai.sav')
all_data.head()

2024-02-12 21:26:11.394394: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Unnamed: 0,QKEY,INTERVIEW_START_W99,INTERVIEW_END_W99,DEVICE_TYPE_W99,LANG_W99,FORM_W99,XTABLET_W99,TECH1_W99,SC1_W99,CNCEXC_W99,...,F_PARTYLN_FINAL,F_PARTYSUM_FINAL,F_PARTYSUMIDEO_FINAL,F_INC_SDT1,F_REG,F_IDEO,F_INTFREQ,F_VOLSUM,F_INC_TIER2,WEIGHT_W99
0,100260.0,2021-11-03 14:25:27,2021-11-03 14:45:34,Laptop/PC,English,Form 2,Non-tablet HH,,Mostly positive,Equally concerned and excited,...,,Rep/Lean Rep,Conservative Rep/Lean,"$50,000 to less than $60,000",You are ABSOLUTELY CERTAIN that you are regist...,Very conservative,Several times a day,No,Middle income,0.206396
1,100314.0,2021-11-04 12:35:35,2021-11-04 12:55:29,Smartphone,English,Form 1,Non-tablet HH,Mostly positive,,More excited than concerned,...,,Rep/Lean Rep,Moderate/Liberal Rep/Lean,"$40,000 to less than $50,000",You are ABSOLUTELY CERTAIN that you are regist...,Liberal,Several times a day,Yes,Middle income,0.31509
2,100363.0,2021-11-03 20:23:43,2021-11-03 20:36:24,Smartphone,English,Form 1,Non-tablet HH,Mostly positive,,Equally concerned and excited,...,,Dem/Lean Dem,Moderate/Conservative Dem/Lean,"$100,000 or more",You are ABSOLUTELY CERTAIN that you are regist...,Moderate,Several times a day,No,Upper income,0.829579
3,100598.0,2021-11-02 13:01:05,2021-11-04 12:37:42,Laptop/PC,English,Form 2,Non-tablet HH,,Mostly positive,Equally concerned and excited,...,,Rep/Lean Rep,Conservative Rep/Lean,"$100,000 or more",You are ABSOLUTELY CERTAIN that you are regist...,Conservative,Several times a day,Yes,Upper income,0.337527
4,100637.0,2021-11-02 12:32:58,2021-11-02 12:46:23,Laptop/PC,English,Form 2,Non-tablet HH,,Equal positive and negative effects,Equally concerned and excited,...,The Republican Party,Rep/Lean Rep,Conservative Rep/Lean,"$30,000 to less than $40,000",You are ABSOLUTELY CERTAIN that you are regist...,Very conservative,Less often,No,Lower income,1.210606


In [8]:
all_data.value_counts('F_EDUCCAT')

F_EDUCCAT
College graduate+        5223
Some College             3259
H.S. graduate or less    1746
Refused                    32
Name: count, dtype: int64

In [19]:
categories = [
  'F_AGECAT', 'F_GENDER', 'F_EDUCCAT', 
  'F_MARITAL',  
  'F_ATTEND', 
  'F_IDEO', 
  'F_INC_SDT1'
]
all_data[categories].head()

Unnamed: 0,F_AGECAT,F_GENDER,F_EDUCCAT,F_MARITAL,F_ATTEND,F_IDEO,F_INC_SDT1
0,65+,A man,College graduate+,Never been married,Seldom,Very conservative,"$50,000 to less than $60,000"
1,65+,A man,Some College,Divorced,Seldom,Liberal,"$40,000 to less than $50,000"
2,30-49,A woman,College graduate+,Married,A few times a year,Moderate,"$100,000 or more"
3,50-64,A woman,College graduate+,Married,Seldom,Conservative,"$100,000 or more"
4,65+,A woman,Some College,Married,Once or twice a month,Very conservative,"$30,000 to less than $40,000"


In [32]:
data = pd.get_dummies(data=all_data[categories], columns=categories).astype("int")
refused = [i for i in data.columns if "Refused" in i]
data = all_data.drop(refused, axis=1)
data.head()


Unnamed: 0,F_AGECAT_18-29,F_AGECAT_30-49,F_AGECAT_50-64,F_AGECAT_65+,F_GENDER_A man,F_GENDER_A woman,F_GENDER_In some other way,F_EDUCCAT_College graduate+,F_EDUCCAT_H.S. graduate or less,F_EDUCCAT_Some College,...,F_IDEO_Very liberal,"F_INC_SDT1_$100,000 or more","F_INC_SDT1_$30,000 to less than $40,000","F_INC_SDT1_$40,000 to less than $50,000","F_INC_SDT1_$50,000 to less than $60,000","F_INC_SDT1_$60,000 to less than $70,000","F_INC_SDT1_$70,000 to less than $80,000","F_INC_SDT1_$80,000 to less than $90,000","F_INC_SDT1_$90,000 to less than $100,000","F_INC_SDT1_Less than $30,000"
0,0,0,0,1,1,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,1,1,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
2,0,1,0,0,0,1,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,1,0,0,1,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,1,0,1,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0


In [2]:
# Split the dataset into train and test sets
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

# Further split the training set into train and validation sets
train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42) # 0.25 x 0.8 = 0.2