# Data exploration 

---

Group name: Gruppe E

---


## Introduction

Die Daten enthalten Informationen darüber, wie Amerikaner ihr Steak mögen (well, medium-well, medium, medium-rare oder rare) und wie risokoreich Amerikaner sind.  Zudem wurde das Alter, das Geschlecht, die Location, ob getraucht wird sowie ob Alkohol getrunden wird, erfasst. Die Erhebung der Daten fand unteranderem statt, um zu untersuchen, ob es einen Zusammenhang zwischen dem Zustand des Steaks und der Risikobereitschaft gibt.

## Setup

In [163]:
import pandas as pd
import altair as alt
from vega_datasets import data

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [164]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Data

## Import data

In [165]:
ROOT = "https://raw.githubusercontent.com/eo026/homework-1/main/data/external/"
DATA = "data.csv"

df = pd.read_csv(ROOT + DATA)

### Data structure

In [166]:
df

Unnamed: 0,Candidate,Gender,Race 1,Race 2,Race 3,Incumbent,Incumbent Challenger,State,Primary Date,Office,...,Runoff Outcome,Trump,Trump Date,Club for Growth,Party Committee,Renew America,E-PAC,VIEW PAC,Maggie's List,Winning for Women
0,"Aditya ""A.D."" Atholi",Male,Asian (Indian),,,No,No,Texas,3/1/22,Representative,...,,,,,,,,,,
1,Joe McDaniel,Male,White,,,No,No,Texas,3/1/22,Representative,...,,,,,,,,,,
2,Nathaniel Moran,Male,White,,,No,No,Texas,3/1/22,Representative,...,,,,,,,,,,
3,John Porro,Male,White,,,No,No,Texas,3/1/22,Representative,...,,,,,,,,,,
4,Dan Crenshaw,Male,White,,,Yes,No,Texas,3/1/22,Representative,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1594,Karen Testerman,Female,Asian (Japanese),White,,No,Yes,New Hampshire,9/13/22,Governor,...,,,,,No,,,,,
1595,Allen R. Waters,Male,Black,,,No,No,Rhode Island,9/13/22,Representative,...,,,,,No,,,,,
1596,Allan W. Fung,Male,Asian (Chinese),,,No,No,Rhode Island,9/13/22,Representative,...,,,,,Yes,,,,,
1597,Ashley Marie Kalus,Female,White,,,No,No,Rhode Island,9/13/22,Governor,...,,,,,,,,,,


In [167]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Candidate             1599 non-null   object
 1   Gender                1599 non-null   object
 2   Race 1                1599 non-null   object
 3   Race 2                30 non-null     object
 4   Race 3                1 non-null      object
 5   Incumbent             1599 non-null   object
 6   Incumbent Challenger  1599 non-null   object
 7   State                 1599 non-null   object
 8   Primary Date          1599 non-null   object
 9   Office                1599 non-null   object
 10  District              1214 non-null   object
 11  Primary Votes         1592 non-null   object
 12  Primary %             1592 non-null   object
 13  Primary Outcome       1599 non-null   object
 14  Runoff Votes          39 non-null     object
 15  Runoff %              39 non-null     

### Data corrections

In [168]:
df['Gender'] = df['Gender'].astype("category")

In [169]:
source = df[['Gender']]

In [170]:
source = source.reset_index()

In [171]:
source

Unnamed: 0,index,Gender
0,0,Male
1,1,Male
2,2,Male
3,3,Male
4,4,Male
...,...,...
1594,1594,Female
1595,1595,Male
1596,1596,Male
1597,1597,Female


## Exploratory data analysis

In [172]:
chartRace = alt.Chart(source).mark_arc().encode(
    theta=alt.Theta(field="index", type="quantitative"),
    color= alt.Color ('Gender', 
                     legend=alt.Legend(title="Gender")),
    tooltip = ["Gender"]
).properties( 
    title= 'Which gender are the candidates?',
    width= 300,
    height= 300

).configure_title(
    fontSize=12,
    font='Arial',
    anchor='start',
    color='black'
)

pie = chartRace.mark_arc(outerRadius=125)

pie

In [126]:
df['State'] = df['State'].astype("category")

In [128]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('State', 
            sort="-y",
            axis=alt.Axis(title="State", 
                          labelAngle=0)), 
    y=alt.Y('count(State)',              
            axis=alt.Axis(title = "Count")),
    color=alt.Color('State', scale=alt.Scale(scheme='pastel2'))        
).properties(
    title='Count of candidates in the states',
    width=400,
    height=250
)

In [173]:
df['State'] = df['State'].astype("category")

In [191]:
ChartState = alt.Chart(df).mark_bar().encode(
    x=alt.X('count(State)',              
            axis=alt.Axis(title = "Count")),
    y=alt.X('State', 
            axis=alt.Axis(title="State", 
                          labelAngle=0)), 
    color=alt.Color('State', scale=alt.Scale(scheme='pastel2')),
    tooltip = ["count(State)","State"]

).interactive(    

).properties(title='Count of candidates in the states',
             width=400,
             height=800
).configure_title(
    fontSize=12,
    font='Arial',
    anchor='middle',
    color='black'
)
ChartState

In [175]:
df['State'] = df['State'].astype("category")
df['Gender'] = df['Gender'].astype("category")

In [199]:
alt.Chart(source).mark_bar().encode(
    x=alt.X('sum(state)', stack="normalize"),
    y='state',
    
)

ValueError: sum(state) encoding field is specified without a type; the type cannot be inferred because it does not match any column in the data.

alt.Chart(...)

In [204]:
alt.Chart(df).mark_circle().encode(
    x='count(Gender)',
    y='State',
    color='Gender',
    tooltip=["count(Gender)", "State"]
).interactive()