# Plotly Basics — Introduction to Data Visualization

In this lesson, we'll learn how to create interactive visualizations using the Plotly library, which is one of the most powerful and flexible visualization tools for Python. We'll be working with the 1880 Utah Census dataset to visualize historical demographic information.


## Setting up Plotly

First, we need to import the necessary libraries. We'll use Pandas for data manipulation and Plotly Express, which provides a simplified API for creating Plotly visualizations.


In [28]:
# Import libraries
import pandas as pd
import plotly.express as px
#import plotly.graph_objects as go
#from plotly.subplots import make_subplots

# Set pandas display options
pd.options.display.max_rows = 10

## Loading the Utah Census Data

Now, let's load the 1880 Utah Census dataset. This historical dataset contains demographic information about Utah residents from the 1880 census.


In [29]:
# Load the census data
utah_df = pd.read_csv('utah-census-1880.csv')

# Take a quick look at the data
utah_df.head()

Unnamed: 0,key,page,line,district,location,county,household,family,lastname,firstname,...,deaf,idiotic,insane,maimed,attendschool,notread,notwrite,birthstate,fbirthplace,mbirthplace
0,0212A_2_NEWBOLD_29,0212A,2,3072,Smithfield,CACHE,9157,158,NEWBOLD,SAMUEL,...,n,n,n,n,n,n,n,ENGLAND,ENGLAND,ENGLAND
1,0163A_45_CHRISTIANSEN_37,0163A,45,2560,Hyrum,CACHE,150,151,CHRISTIANSEN,MARIAH,...,n,n,n,n,n,n,n,DENMARK,DENMARK,DENMARK
2,0120A_1_JENSEN_71,0120A,1,2048,Logan City,CACHE,1,1,JENSEN,OLEY,...,n,n,n,n,n,n,n,SWEDEN,SWEDEN,SWEDEN
3,0120A_2_JENSEN_69,0120A,2,2048,Logan City,CACHE,1,1,JENSEN,INGER,...,n,n,n,n,n,n,y,SWEDEN,SWEDEN,SWEDEN
4,0120A_3_ENGBER_6,0120A,3,2048,Logan City,CACHE,1,1,ENGBER,JOSEPH,...,n,n,n,n,n,n,n,UTAH,SWEDEN,SWEDEN


The dataset should contain information like names, ages, occupations, birthplaces, and other demographic details from the 1880 census.


In [30]:
# Count the number of people in each county
county_counts = utah_df['county'].value_counts()

# Display the counts
county_counts


county
SALT LAKE    31989
UTAH         17967
CACHE        12562
WEBER        12358
SANPETE      11538
             ...  
PIUTE         1616
UINTAH         799
EMERY          556
RICH           281
SAN JUAN       204
Name: count, Length: 23, dtype: int64

In [31]:
# Apply reset_index() to convert the index (county names) into a regular column
county_counts = county_counts.reset_index()
county_counts.columns = ['county', 'count']

# Display the counts after reset_index
county_counts

Unnamed: 0,county,count
0,SALT LAKE,31989
1,UTAH,17967
2,CACHE,12562
3,WEBER,12358
4,SANPETE,11538
...,...,...
18,PIUTE,1616
19,UINTAH,799
20,EMERY,556
21,RICH,281


In [32]:
# Create a simple bar chart
fig = px.bar(county_counts, x='county', y='count')

# Show the figure
fig.show()


In [33]:
# Add title and customize axis labels
fig = px.bar(
    county_counts,
    x='county',
    y='count',
    title='Population by County in Utah (1880)',  # Add a title
    labels={'county': 'County', 'count': 'Population'}  # Rename axis labels
)

# Display the chart
fig.show()

In [34]:
# Add color and use a cleaner template
fig = px.bar(
    county_counts,
    x='county',
    y='count',
    title='Population by County in Utah (1880)',
    labels={'county': 'County', 'count': 'Population'},
    color='count',                     # Color bars by population
    color_continuous_scale='Inferno_r',  # Use a reversed color scale
    template='plotly_white'            # Use a clean white template
)

# Display the chart
fig.show()

In [35]:
# Update layout with additional customizations
fig.update_layout(
    xaxis_title='County',                  # Customize x-axis title
    yaxis_title='Number of People',        # Customize y-axis title
    xaxis_tickangle=-45,                   # Rotate x-axis labels 45 degrees
    height=500,                            # Set chart height in pixels
    width=800,                             # Set chart width in pixels
    title_font=dict(size=22),              # Change title font size
    plot_bgcolor='white',  # Set plot background color
    margin=dict(l=40, r=40, t=80, b=80),   # Adjust margins (left, right, top, bottom)
    showlegend=True,                       # Show the color scale legend
    legend_title_text='Population',        # Set legend title
    hoverlabel=dict(                       # Customize hover label appearance
        bgcolor="white",
        font_size=12,
        font_family="Arial")
)

# Display the chart
fig.show()

In [36]:
# First, let's prepare our data
# Count the frequency of each occupation, take the top 10, and reset the index
top_occupations = utah_df['occupation'].value_counts().reset_index()
top_occupations.columns = ['occupation', 'count']
top_occupations = top_occupations.head(10)

# Display the prepared data
top_occupations

Unnamed: 0,occupation,count
0,KEEPING HOUSE,23031
1,AT HOME,13743
2,FARMER,9535
3,LABORER,6846
4,AT SCHOOL,6549
5,MINER,2596
6,WORKS ON FARM,2026
7,CARPENTER,1225
8,SERVANT,1219
9,FARM LABORER,1065


In [37]:
# 3. Enhance appearance with update_layout()
fig = px.bar(
    top_occupations,
    x='occupation',
    y='count',
    color='count',  # Color bars by count
    title='Top 10 Occupations in Utah (1880)',
    labels={'occupation': 'Occupation', 'count': 'Number of People'},
    color_continuous_scale='Viridis',
    template='plotly_dark'
)

# Update layout with additional customizations
fig.update_layout(
    xaxis_tickangle=-45,  # Rotate x-axis labels 45 degrees
    height=400,  # Set chart height in pixels
    width=800,  # Set chart width in pixels
    title_font=dict(size=30),  # Change title font size to 30
    margin=dict(l=80, r=80, t=100, b=100),  # Different margins
    coloraxis_colorbar=dict(title='Number of People'),  # Add a legend title
    hoverlabel=dict(  # Customize hover label appearance
        bgcolor="black",
        font_size=14,
        font_family="Courier New"
    )
)

# Display the chart
fig.show()

In [67]:
# Create a subset with employment information and age
# First, let's see what occupations are available
utah_df['occupation'].value_counts()

occupation
KEEPING HOUSE           23031
AT HOME                 13743
FARMER                   9535
LABORER                  6846
AT SCHOOL                6549
                        ...  
CLERK IN SURVEYOR           1
NIGHTWATCHMAN               1
"ARCHITECT, BUILDER"        1
ACTRESS                     1
R.R. FIREMAN                1
Name: count, Length: 3320, dtype: int64

This pie chart shows the proportion of males (M) and females (F) recorded in the 1880 Utah census.


## Occupational Distribution

Now, let's explore the most common occupations in the 1880 Utah census:


In [43]:
# Get the top 10 most common occupations (excluding blank entries)
occupation_counts = census_df['occupation'].value_counts().reset_index()
occupation_counts.columns = ['Occupation', 'Count']
occupation_counts = occupation_counts[occupation_counts['Occupation'] != ''].head(10)

# Create a horizontal bar chart
fig = px.bar(occupation_counts, y='Occupation', x='Count', 
            title='Top 10 Occupations in 1880 Utah Census',
            color='Count',
            color_continuous_scale='Viridis',
            orientation='h')

# Update the layout
fig.update_layout(
    yaxis=dict(autorange="reversed"),  # This puts the largest value at the top
    height=500,
    width=700
)

# Display the plot
fig.show()

This horizontal bar chart displays the 10 most common occupations in the 1880 Utah census. The bars are sorted with the most common occupation at the top, and the color intensity indicates the count as well.


## Birthplace Analysis

Let's visualize where people in Utah were born:


In [None]:
# Get the top birthplaces (excluding blank entries)
birthplace_counts = census_df.groupby('birthstate').size().reset_index(name='Count')
birthplace_counts.columns = ['Birthplace', 'Count']
birthplace_counts = birthplace_counts[birthplace_counts['Birthplace'] != ''].head(10)

# Create a bar chart
fig = px.bar(birthplace_counts, x='Birthplace', y='Count',
            title='Top 10 Birth Places in 1880 Utah Census',
            color='Count',
            text='Count',
            color_continuous_scale='Teal')

# Update the layout
fig.update_layout(
    xaxis_title="Birthplace",
    yaxis_title="Number of People",
    xaxis_tickangle=-45
)

# Display the plot
fig.show()

This bar chart shows the 10 most common birthplaces of Utah residents in 1880. The exact count values are displayed on top of each bar.


## Creating a Scatter Plot

Now, let's create a scatter plot to explore if there's a relationship between age and the month they were born:


In [11]:
# Create a sample of the data (to make the plot more manageable)
census_sample = census_df[census_df['birthmonth'] > 0].sample(1000, random_state=42)

# Create a scatter plot
fig = px.scatter(census_sample, x='age', y='birthmonth',
                title='Age vs Birth Month in 1880 Utah Census (Sample)',
                color='sex',
                size='age',
                hover_name='firstname',
                hover_data=['lastname', 'occupation'],
                opacity=0.7)

# Update the layout
fig.update_layout(
    xaxis_title="Age",
    yaxis_title="Birth Month (1-12)",
    yaxis=dict(
        tickmode='array',
        tickvals=list(range(1, 13)),
        ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    )
)

# Display the plot
fig.show()

This scatter plot shows the relationship between age and birth month for a sample of the census data, with points colored by gender and sized by age. Try hovering over points to see detailed information about each person.


## Creating Multiple Subplots

Plotly also allows us to create complex visualizations with multiple subplots. Let's create a figure with two visualizations side by side:


In [12]:
# Create a figure with 1 row and 2 columns
fig = make_subplots(rows=1, cols=2, 
                    subplot_titles=("Age Distribution by Gender", "Marital Status Distribution"),
                    specs=[[{"type": "box"}, {"type": "pie"}]])

# Add box plot of ages by gender
for gender in ['M', 'F']:
    gender_data = census_df[census_df['sex'] == gender]
    fig.add_trace(
        go.Box(y=gender_data['age'], name=gender),
        row=1, col=1
    )

# Add pie chart of marital status
marital_counts = census_df['marrystatus'].value_counts()
fig.add_trace(
    go.Pie(labels=marital_counts.index, values=marital_counts.values),
    row=1, col=2
)

# Update the layout
fig.update_layout(
    title_text="Census Data Analysis",
    height=500,
    width=900
)

# Display the plot
fig.show()

This creates a figure with two plots:

- A box plot showing age distribution by gender
- A pie chart showing marital status distribution

Note that we're using the `make_subplots` function to create the layout, and then adding individual traces to specific subplot positions.


## Customizing Your Plots

Let's create one more visualization with more customization options:


In [None]:
# Calculate average age by occupation (for top 15 occupations with at least 100 people)
occupation_counts = census_df['occupation'].value_counts()
common_occupations = occupation_counts[occupation_counts >= 100].index.tolist()
common_occupations = [occ for occ in common_occupations if occ != ''][:15]

# Filter the data
occupation_age_data = census_df[census_df['occupation'].isin(common_occupations)]
occupation_age = occupation_age_data.groupby('occupation')['age'].mean().reset_index()
occupation_age = occupation_age.sort_values('age', ascending=False)

# Create a horizontal bar chart
fig = go.Figure()

# Add a bar trace
fig.add_trace(
    go.Bar(
        y=occupation_age['occupation'],
        x=occupation_age['age'],
        orientation='h',
        marker=dict(
            color=occupation_age['age'],
            colorscale='Viridis',
            colorbar=dict(title="Average Age")
        )
    )
)

# Update the layout
fig.update_layout(
    title="Average Age by Occupation in 1880 Utah Census",
    xaxis_title="Average Age",
    yaxis_title="Occupation",
    yaxis=dict(autorange="reversed"),
    height=600,
    width=800,
    template="plotly_white"
)

# Display the plot
fig.show()

This creates a horizontal bar chart showing the average age for different occupations, with extensive customization.

Note that in this example, we're using the lower-level `go.Figure()` approach instead of Plotly Express, which gives us more fine-grained control over the visualization.


## Saving Plotly Visualizations

Finally, let's see how to save our visualizations:


In [None]:
# Save as a static image (requires kaleido package)
# fig.write_image("occupation_age.png")

# Save as an interactive HTML file
fig.write_html("occupation_age.html")

These commands would save your visualization as an interactive HTML file that you can open in a web browser. To save as a static PNG image, you'll need to install the kaleido package (`pip install kaleido`).


## Exercise: Create Your Own Visualization

Try to create a visualization on your own using the Utah census data. Here are some ideas:

- Create a histogram of family sizes (using the 'family' column)
- Visualize literacy rates (using 'notread' and 'notwrite' columns) by age group
- Create a scatter plot comparing age and occupation
- Visualize the distribution of people who attended school ('attendschool')


In [None]:
# Your code here



## Conclusion

In this tutorial, you've learned the basics of creating interactive visualizations with Plotly. We've covered:

- Creating basic plots (histograms, pie charts, bar charts, scatter plots)
- Customizing plot appearance
- Working with subplots
- Saving visualizations

Plotly is a powerful library with many more features than we've explored here. For more information, check out the [Plotly Python Documentation](https://plotly.com/python/).

Remember that good data visualization helps tell a story about your data, making patterns and insights more accessible. As you continue to work with data, developing your visualization skills will be crucial for both your analysis and communication of results.
