# Data exploration 

---

Group name: Gruppe E

---


## Introduction

The dataset contains information on the candidates of the Republican Party in the United States. Each candidate is listed by name. It also indicates their gender, race and which state they come from. There is also information about the office, district, primary votes and much more.

## Setup

In [4]:
import pandas as pd
import altair as alt
from vega_datasets import data

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [5]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Data

## Import data

In [6]:
ROOT = "https://raw.githubusercontent.com/eo026/homework-1/main/data/external/"
DATA = "data.csv"

df = pd.read_csv(ROOT + DATA)

### Data structure

In [7]:
df

Unnamed: 0,Candidate,Gender,Race 1,Race 2,Race 3,Incumbent,Incumbent Challenger,State,Primary Date,Office,...,Runoff Outcome,Trump,Trump Date,Club for Growth,Party Committee,Renew America,E-PAC,VIEW PAC,Maggie's List,Winning for Women
0,"Aditya ""A.D."" Atholi",Male,Asian (Indian),,,No,No,Texas,3/1/22,Representative,...,,,,,,,,,,
1,Joe McDaniel,Male,White,,,No,No,Texas,3/1/22,Representative,...,,,,,,,,,,
2,Nathaniel Moran,Male,White,,,No,No,Texas,3/1/22,Representative,...,,,,,,,,,,
3,John Porro,Male,White,,,No,No,Texas,3/1/22,Representative,...,,,,,,,,,,
4,Dan Crenshaw,Male,White,,,Yes,No,Texas,3/1/22,Representative,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1594,Karen Testerman,Female,Asian (Japanese),White,,No,Yes,New Hampshire,9/13/22,Governor,...,,,,,No,,,,,
1595,Allen R. Waters,Male,Black,,,No,No,Rhode Island,9/13/22,Representative,...,,,,,No,,,,,
1596,Allan W. Fung,Male,Asian (Chinese),,,No,No,Rhode Island,9/13/22,Representative,...,,,,,Yes,,,,,
1597,Ashley Marie Kalus,Female,White,,,No,No,Rhode Island,9/13/22,Governor,...,,,,,,,,,,


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Candidate             1599 non-null   object
 1   Gender                1599 non-null   object
 2   Race 1                1599 non-null   object
 3   Race 2                30 non-null     object
 4   Race 3                1 non-null      object
 5   Incumbent             1599 non-null   object
 6   Incumbent Challenger  1599 non-null   object
 7   State                 1599 non-null   object
 8   Primary Date          1599 non-null   object
 9   Office                1599 non-null   object
 10  District              1214 non-null   object
 11  Primary Votes         1592 non-null   object
 12  Primary %             1592 non-null   object
 13  Primary Outcome       1599 non-null   object
 14  Runoff Votes          39 non-null     object
 15  Runoff %              39 non-null     

## Visualization 1

### Data corrections

In [9]:
df['Gender'] = df['Gender'].astype("category")

### Exploratory data analysis

In [10]:
source = df[['Gender']]

In [11]:
source = source.reset_index()

In [12]:
source

Unnamed: 0,index,Gender
0,0,Male
1,1,Male
2,2,Male
3,3,Male
4,4,Male
...,...,...
1594,1594,Female
1595,1595,Male
1596,1596,Male
1597,1597,Female


In [31]:
chartRace = alt.Chart(source).mark_arc().encode(
    theta=alt.Theta(field="index", type="quantitative"),
    color= alt.Color ('Gender', 
                     legend=alt.Legend(title="Gender")),
    tooltip = ["Gender"]
).properties( 
    title= 'Which gender are the candidates?',
    width= 300,
    height= 300

).configure_title(
    fontSize=15,
    font='Arial',
    anchor='start',
    color='black'
)

pie = chartRace.mark_arc(outerRadius=125)

pie

Explanation for the choice of this visualization types:

I have chosen this visualization type because the pie chart makes the proportion of male and female Republican candidates clear at first glance. It is immediately apparent that the proportion of male candidates is far higher than the proportion of female candidates.

## Visualization 2

### Data corrections

In [16]:
df['State'] = df['State'].astype("category")

### Exploratory data analysis

In [32]:
ChartState = alt.Chart(df).mark_bar().encode(
    x=alt.X('count(State)',              
            axis=alt.Axis(title = "Count")),
    y=alt.X('State', 
            axis=alt.Axis(title="State", 
                          labelAngle=0)), 
    color=alt.Color('State', scale=alt.Scale(scheme='set3')),
    tooltip = ["count(State)","State"]

).interactive(    

).properties(title='Count of candidates in the states',
             width=400,
             height=800
).configure_title(
    fontSize=15,
    font='Arial',
    anchor='middle',
    color='black'
)
ChartState

Explanation for the choice of this visualization types:

I have chosen a Horizontal Bar Chart because it shows the number of candidates per state in an optimal way. The number of candidates can be read on the x-axis and the states on the y-axis. Because of the large number of states, only a horizontal bar chart came into question. On a simple bar chart, not all states would have been well visible on the x-axis (lack of space).

## Visualization 3

### Data corrections

In [18]:
df['State'] = df['State'].astype("category")
df['Gender'] = df['Gender'].astype("category")

### Exploratory data analysis

In [19]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('count(Gender)', stack="normalize",
        axis=alt.Axis(title = "Proportion of male and female candidates")),
    y=alt.Y('State',
        axis=alt.Axis(title = "State")),
    color='Gender',
    tooltip=["Gender", "count(Gender)", "State"]
).interactive(
).properties(title='Proportion of male and female candidates in the states',
             width=400,
             height=800
).configure_title(
    fontSize=15,
    font='Arial',
    anchor='middle',
    color='black'
)

Explanation for the choice of this visualization types:

On the Normalized Stacked Bar Chart, the percentage of male and female candidates per state can be seen at a glance. Moreover, this vizualisation allows for comparability between states. It becomes clear how the proportions of the gender of the candidates differ between the states.

## Visualization 4

### Exploratory data analysis

In [20]:
alt.Chart(df).mark_circle().encode(
    x=alt.X('count(Gender)',
        axis=alt.Axis(title = "Count")),
    y=alt.Y('State',
        axis=alt.Axis(title = "State")),
    color='Gender',
    tooltip=["Gender", "count(Gender)", "State"]
).interactive(
    
).properties(
    title='Count of candidates in the states by gender',

).configure_title(
    fontSize=15,
    font='Arial',
    anchor='middle',
    color='black'
)

Explanation for the choice of this visualization types:

In contrast to the Normalized Stacked Bar Chart, the Simple Scatter Plot shows the absolute number of male and female Republican candidates. I have additionally chosen this vizualisation because it allows the total number of candidates per state to be viewed as well as the number of male and female candidates.

## Visualization 5

### Data corrections

In [27]:
df['Race 1'] = df['Race 1'].astype("category")

### Exploratory data analysis

In [30]:
alt.Chart(df).mark_bar().encode(
    x=alt.X('count(Race 1)',
        axis=alt.Axis(title = "Count")),
    y=alt.Y('State',
        axis=alt.Axis(title = "State")),
    color='Race 1',
    tooltip=["Race 1", "count(Race 1)", "State"]
).interactive(
).properties(title='Count of candidates in the states by race',
             width=400,
             height=800
).configure_title(
    fontSize=15,
    font='Arial',
    anchor='middle',
    color='black'
)

Explanation for the choice of this visualization types:

I chose the Horizontal Stacked Bar Chart because the bar shows the proportions of the candidates' races and the length of the bar shows the total number of candidates per state. It is easy to compare which state has the most candidates and which race is most represented in which state.