Specification for Project 6 EDA Notebook

Overview

Project 6 is an opportunity to create your own custom exploratory data analysis (EDA) project using GitHub, Git, Jupyter, pandas, Seaborn and other popular data analytics tools.

Deliverable Names

GitHub Repository: datafun-06-eda
Documentation: README.md
Notebook: yourname_eda.ipynb

External Dependencies

This project will likely use at least the following external modules, so a virtual environment is recommmended. You may adjust packages used as needed.

jupyterlab
pandas
pyarrow
matplotlib
seaborn

Version Control with Git

Create a new GitHub repository named datafun-06-eda.
Clone the repository to your local machine.
Document the steps and commands in your README.md.
Document your workflow and commands as you edit, add, commit, and push changes to the GitHub repository.

Objective

Perform and publish a custom EDA project to demnostrate skills with Jupyter, pandas, Seaborn and popular tools for data analytics. The notebook should tell a data story and visually present findings in a clear and engaging manner.

Explore Datasets

Choose a dataset for analysis. You will want a known, clean dataset. Cleaning data can run 60-80% of the project (or more) - you don't need to demonstrate cleaning skills for this project.
The recommneded approach is to select one of the other pre-installed datasets in Seaborn. You can view a list of the Seaborn datasets in the first link below. The additional links offer a range of options.

You may use your own data if you have permission and there is no confidential information included. Be careful with your data selection and ensure you have rights to use the content.

Requirements

1. Environment Setup

Create and activate the project virtual environment.
Install all required packages into your local project virtual environment.
After installing the required dependencies, update or generate a requirements.txt file.
Add a .gitignore file to your project with useful entries. See .gitignore example.
Document the steps and commands in your README.md.

2. Project Start

Make sure Jupyter is installed and working in your project virtual environment. Document the process and commands you used in your README.md.

Then create, open, and start a new notebook in your root project repository folder:

Create the Notebook: In the VS Code Explorer, create a new file i.e., yourname_eda.ipynb. Ensure it has a .ipynb extension.
Verify your new notebook is open for editing. If needed, view the project files in VS Code Explorer and double-click the notebook file to open it for editing.
Add a Markdown cell at the top of your notebook with the introduction (include the title, author, date and the purpose of the project).

3. Import Dependencies (At the Top, After the Introduction)

Add a Python cell next with the import statements for the libraries you will use in the project. Follow conventional package import organization and alias. Import each package just once near the top of the file. Be sure you have INSTALLED any external packages (outside the Python Standard Library) into your active project virtual environment first.

Jupyter Notebook / Python cell example:

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Execute the cell to ensure everything works. If you get errors on one of the statements above, the most common issue is that package has not been installed into the active project virtual environment. When you find you need a new package, first install it into the active project virtual environment and then import it near the top of your Python or Notebook file.

5. Exploratory Data Analysis

Perform a unique exploratory data analysis project using the tools and skills covered previously.

Step 1. Data Acquisition

Load the data into a pandas DataFrame. Use the pd read functions such as pd.read_csv() or pd.read_excel() as appropriate. To read from the Seaborn dataset, we'll use sns.load_dataset() function and pass in the 'iris' (the name without .csv) to populate our DataFrame.

Jupyter Notebook / Python cell example:

# Load the dataset into a pandas DataFrame - adjust this process for your custom data
df = sns.load_dataset('iris')

# Inspect first rows of the DataFrame
print(df.head())

Step 2. Initial Data Inspection

Display the first 10 rows of the DataFrame, check the shape, and display the data types of each column using df.head(10), df.shape, and df.dtypes.

Jupyter Notebook / Python cell example:

print(df.head(10))
print(df.shape)
print(df.dtypes)

Step 3. Initial Descriptive Statistics

Use the DataFrame describe() method to display summary statistics for each column.

Jupyter Notebook / Python cell example:

print(df.describe())

Step 4. Initial Data Distribution for Numerical Columns

Choose a numerical column and use df['column_name'].hist() to plot a histogram for that specific column. To show all the histograms for all numerical columns, use df.hist().

Jupyter Notebook / Python cell example:

# Inspect histogram by numerical column
df['sepal_length'].hist()

# Inspect histograms for all numerical columns
df.hist()

# Show all plots
plt.show()

Afterwards, use a Markdown cell to document your observations.

Step 5. Initial Data Distribution for Categorical Columns

Choose a categorical column and use df['column_name'].value_counts() to display the count of each category. Use a loop to show the value counts for all categorical columns.

Jupyter Notebook / Python cell example:

# Inspect value counts by categorical column
df['species'].value_counts()

# Inspect value counts for all categorical columns
for col in df.select_dtypes(include=['object', 'category']).columns:
    # Display count plot
    sns.countplot(x=col, data=df)
    plt.title(f'Distribution of {col}')
    plt.show()

# Show all plots
plt.show()

Afterwards, use a Markdown cell to document your observations.

Step 6. Initial Data Transformation and Feature Engineering

Use pandas and other tools to perform transformations. Transformation may include renaming columns, adding new columns, or transforming existing data for more in-depth analysis.

For this project, you must:

Rename at least one column.
Add at least one column.

Step 7. Initial Visualizations

Create a variety of chart types using seaborn and matplotlib to showcase different aspects of your data. For each chart, include the goal - what you want to learn/explore, the type of chart you choose, display the chart, and tell your data story. Use Markdown cells and Python cells. Create at least 3 subsections - each subsection should have the following parts:

Goal: The question you are exploring.
Chart Type: Tell us what kind of chart you choose to illustrate this goal.
Chart: Display the chart.
Story: Use Markdown cell(s) to document your observations and insights.

Step 8. Initial Storytelling and Presentation

Present your notebook with an opening that introduces yourself and your topic. Use Markdown section headings to introduce each step. Interpret the visualizations and statistics to narrate a clear and compelling data story. Present your findings in a logical and engaging manner.

Notebook Design

Begin your notebook with a project summary including the title, author, date, and project's purpose. This provides an immediate understanding of the notebook's objective.
Ensure your code and presentation are neat, well-organized, and follow good coding practices. This includes proper variable naming, consistent code style, and logical organization of code cells.
Use Markdown features effectively for formatting, such as section headings, bullet points, and emphasis (bold/italic), to enhance readability.

Notebook Structure and Documentation

Once the notebook runs without errors, focus on how the notebook content is structured and documented. Organize your notebook into well-defined sections, each with a clear purpose and header. Use Markdown cells to provide context, explain your analysis, and share findings. This makes your notebook informative and engaging. Comment your code cells to explain the purpose and functionality of the code. This is especially important for complex or non-obvious code segments.

Notebook Execution

Run your notebook entirely to ensure it executes without errors. This includes checking all code cells and ensuring all data visualizations render as expected. Confirm that your notebook renders correctly on GitHub after pushing, as this ensures your work is viewable by others.

Evaluation Criteria

Functionality: The project should be functional and meet all requirements.
Documentation: The project should be well-written and well-documented.
Presentation: The project should be presented in a clear and organized manner.
Professionalism: The project should be submitted on-time and reflect an original, creative effort.

See rubric for additional information.

Resources

See datafun-04-spec for a guided EDA.
See JUPYTER.md for Jupyter Notebook keyboard shortcuts and recommendations.
See MARKDOWN.md for Markdown syntax and recommendations.
See Plotting graph For IRIS Dataset Using Seaborn And Matplotlib
See Seaborn Tutorial

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Specification for Project 6 EDA Notebook

Overview

Deliverable Names

External Dependencies

Version Control with Git

Objective

Explore Datasets

Requirements

1. Environment Setup

2. Project Start

3. Import Dependencies (At the Top, After the Introduction)

5. Exploratory Data Analysis

Step 1. Data Acquisition

Step 2. Initial Data Inspection

Step 3. Initial Descriptive Statistics

Step 4. Initial Data Distribution for Numerical Columns

Step 5. Initial Data Distribution for Categorical Columns

Step 6. Initial Data Transformation and Feature Engineering

Step 7. Initial Visualizations

Step 8. Initial Storytelling and Presentation

Notebook Design

Notebook Structure and Documentation

Notebook Execution

Evaluation Criteria

Resources

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md

denisecase/datafun-06-spec

Folders and files

Latest commit

History

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Specification for Project 6 EDA Notebook

Overview

Deliverable Names

External Dependencies

Version Control with Git

Objective

Explore Datasets

Requirements

1. Environment Setup

2. Project Start

3. Import Dependencies (At the Top, After the Introduction)

5. Exploratory Data Analysis

Step 1. Data Acquisition

Step 2. Initial Data Inspection

Step 3. Initial Descriptive Statistics

Step 4. Initial Data Distribution for Numerical Columns

Step 5. Initial Data Distribution for Categorical Columns

Step 6. Initial Data Transformation and Feature Engineering

Step 7. Initial Visualizations

Step 8. Initial Storytelling and Presentation

Notebook Design

Notebook Structure and Documentation

Notebook Execution

Evaluation Criteria

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages