# Exploring Data Sets and Formulating a Hypothesis

In this section, you will explore the provided data sets to understand their structure and content. Follow the steps below to guide your exploration and hypothesis formulation:

1. **Load the Data Sets**: Begin by loading the data sets into your environment. Use appropriate libraries such as `pandas` to read the data files.

2. **Inspect the Data**: Examine the first few rows of each data set using functions like `head()`. Check for missing values, data types, and overall structure.

3. **Summary Statistics**: Generate summary statistics for numerical columns using `describe()`. This will give you an overview of the central tendency, dispersion, and shape of the data distribution.

4. **Data Visualization**: Create visualizations to identify patterns, trends, and outliers. Use libraries such as `matplotlib` and `seaborn` to create plots like histograms, scatter plots, and box plots.

5. **Identify Relationships**: Look for relationships between different variables. Use correlation matrices and pair plots to identify potential connections.

6. **Formulate a Hypothesis**: Based on your exploration, come up with a hypothesis that you can test. A hypothesis is a statement that can be tested by further analysis and experimentation.

7. **Document Your Findings**: Keep detailed notes of your observations and the steps you took during your exploration. This documentation will be valuable for future reference and for communicating your findings to others.

By following these steps, you will gain a thorough understanding of the data sets and be well-prepared to formulate a meaningful hypothesis.

**Goal**: From your EDA, formulate 1 (or more) hypothesis that you'd like to test in a future analysis. This hypothesis should be based on the patterns and relationships you observed during your exploration.

## Imports and loading data


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
plt.rcParams.update({'font.family': 'cmr10',
                     'font.size': 12,
                     'axes.unicode_minus': False,
                     'axes.labelsize': 12,
                     'figure.figsize': (3, 3),
                     'figure.dpi': 80,
                     'mathtext.fontset': 'cm',
                     'mathtext.rm': 'serif',
                     'xtick.direction': 'in',
                     'ytick.direction': 'in',
                     'xtick.top': True,
                     'ytick.right': True
                     })

In [3]:
bike = pd.read_csv("../../data/fremont-bridge-bicycle-counts-exercise 2.csv")
weather = pd.read_csv("../../data/NOAA_Seattle-data 2.csv")

**NOTE**: Not all of the columns in the weather data are easily interpretable. The following data dictionary may help you understand the data better:

- AWND: Average daily wind speed.
- PGTM: Peak gust time, the time of the peak wind gust.
- PRCP: Precipitation, the total amount of precipitation for the day.
- SNOW: Snowfall, the total amount of snowfall for the day.
- SNWD: Snow depth, the depth of snow on the ground.
- TAVG: Average temperature for the day.
- TMAX: Maximum temperature for the day.
- TMIN: Minimum temperature for the day.
- WDF2: Direction of the fastest 2-minute wind.
- WDF5: Direction of the fastest 5-minute wind.
- WSF2: Fastest 2-minute wind speed.
- WSF5: Fastest 5-minute wind speed