## Lab 03.01 - Introduction to Pandas and Data Operations
In this lab, you'll learn how to use the pandas library to analyze real-world sales data. By the end of this lab, you'll be comfortable with the fundamental operations in pandas.

### Part 0 - Intro to Pandas
Before we dive into using pandas, let's understand what it is and why it's important in data science.

To answer the questions, edit the markdown cell and put your answer below the question. 

**Make sure to save the markdown cell, by pressing the âœ“ (check) icon in the top right after answering the questions**

Research and critically think about the following questions:

##### Question 00
What is Pandas, and what are its primary functions in data visualization for data science?
- **Answer:** Pandas is a powerful Python library for data manipulation and analysis. Its primary functions in data visualization include data cleaning, transformation, merging, and providing data structures optimized for efficient operations on tabular and time series data.

##### Question 01
What does "data manipulation" mean in the context of pandas and data science field?
- **Answer:**  In the context of pandas and data science, "data manipulation" refers to the process of transforming and reorganizing data to make it more suitable for analysis. This includes tasks like filtering, sorting, aggregating, and reshaping data to extract meaningful insights.

##### Question 02
What kind of files can pandas read? List at least 3:
- **Answer:** Pandas can read various file formats, including CSV, Excel (XLS, XLSX), JSON, SQL databases, and HDF5. It also supports reading from HTML tables and clipboard data.

#### 0.0 - Hands-On Exploration
Let's start by importing pandas and exploring its basic functionality. Type these commands and write down what you observe:

First, let's install the matplotlib library, in your terminal run the following command:

```bash
pip3 install pandas
```

Once installed, let's start by importing pandas and exploring its basic functionality. 

In [None]:
import pandas as pd

Now lets run create a list, and see what happens when we use it with pandas.

In [None]:
# Create a simple list of numbers
numbers = [1, 2, 3, 4, 5]

# Convert it to a pandas Series
series = pd.Series(numbers)

# Print the result
print(series)

##### Question 03
What did you notice about the output format?
- **Answer:** The output format displays a pandas Series with an index on the left side and corresponding values on the right. This structured format allows for easy data access and manipulation.

##### Question 04
How is this different from a regular Python list? What are the numbers on the left side?
- **Answer:** Unlike a regular Python list, the pandas Series has an index associated with each value. The numbers on the left side represent the default integer index, which allows for efficient data retrieval and alignment.

### Part 1 - Reading and Exploring Data

#### 01.00 - Loading Data
Let's start by loading our sales data that you downloaded alongside this lab. In pandas, we can read various file formats, but CSV (Comma-Separated Values) is one of the most common.

In [None]:
sales_df = pd.read_csv('sales_data.csv')

When using pandas to load data, in ingests it into what is known as a "dataframe", this allows us to do advanced manipulation.

#### 01.01 - Viewing Data Samples
Let's explore the fundamental methods for understanding our data structure and content. We'll look at each method individually and understand what it tells us about our data.

In [None]:
print(sales_df.head())

In [None]:
print(sales_df.tail())

##### Question 04
What is the difference between the `tail()` and `head()` commands? What would be use of either command be?
- **Answer:** The `head()` function displays the first few rows of the DataFrame, while `tail()` shows the last few rows. These commands are useful for quickly inspecting the structure and content of a dataset without loading all the data.

##### Question 05
What happens when you put a number in the `head()` function? What changed?
- **Answer:** When you put a number in the `head()` function, it changes the number of rows displayed. For example, head(10) would show the first 10 rows instead of the default 5.

#### 01.02 - Understanding DataFrame Structure
Let's examine the basic properties of our DataFrame:

In [None]:
print(sales_df.columns)

In [None]:
print(sales_df.dtypes)

##### Question 06
Compare and contrast the outputs of the `columns` and `dtypes` properties. What is similar what is different?
- **Answer:** The columns property shows column names, while dtypes displays data types for each column. Both provide structural information, but dtypes gives more detailed insight into data storage.

##### Question 07
What data type is the 'Date' column? Is this what you expected? 
- **Answer:** The 'Date' column is likely stored as an object data type. This might be unexpected, as dates are often stored in a specific datetime format for easier manipulation.

##### Question 08
Why might pandas choose different data types for different columns?
- **Answer:** Pandas chooses different data types for columns to optimize memory usage and enable appropriate operations. It infers types based on the data content, selecting the most suitable type for each column.

#### 01.03 - Data Information Summary
The `info()` method provides a concise summary of our DataFrame:

In [None]:
print(sales_df.info())

##### Question 09
What information does `info()` tell us, how could this information be useful?
- **Answer:** The `info()` method provides a concise summary of the DataFrame, including column names, non-null counts, and data types. This information is useful for quickly understanding the structure and completeness of the dataset.

##### Question 10
How much memory is our DataFrame using? 
- **Answer:** The memory usage of the DataFrame can be found in the output of the info() method. It typically shows the total memory usage in bytes or a human-readable format.

##### Question 11
What does "null" mean? What impact could having "null" values in a dataset?
- **Answer:** "Null" in a dataset refers to missing or undefined values. Having null values can impact analysis by skewing results or causing errors in calculations, requiring careful handling during data preprocessing.

#### 01.04 - Numerical Summaries
The `describe()` method provides statistical summaries for numerical columns:

In [None]:
print(sales_df.describe())

##### Question 12
What statistics does the `describe()` function give us?
- **Answer:** The describe() function provides statistical summaries including count, mean, standard deviation, minimum, maximum, and quartile values. These statistics offer a quick overview of the distribution and central tendencies of numerical data.

##### Question 13
Which columns did the `describe()` function perform statistics on? Why? (Hint: why didnt we see 'Product' in the output?)
- **Answer:** The describe() function performs statistics on numerical columns by default. It excludes non-numeric columns like 'Product' because statistical measures aren't applicable to categorical or text data.

In [None]:
print(sales_df[['Price', 'Units']].describe())

##### Question 14
What is different in the codeblock above compared to the previous one. How did this affect the output?
- **Answer:** The code block `sales_df[['Price', 'Units']].describe()` specifically selects only the 'Price' and 'Units' columns for description. This affected the output by limiting the statistical summary to just these two columns, instead of all numerical columns in the dataset.

In [1]:
print(sales_df['Category'].unique())

NameError: name 'sales_df' is not defined

##### Question 15
How many unique product categories are in our data set?
- **Answer:** 6

In [None]:
print("Output 1:")
print(sales_df['Category'].value_counts())

print('')
print('')

print("Output 2:")
print(sales_df['Category'].value_counts(normalize=True))

##### Question 16
What is the difference between the two outputs?
- **Answer:** The difference between the two outputs is that Output 1 shows the raw count of each category, while Output 2 shows the relative frequency of each category.


##### Question 17
What is our most common product category?
- **Answer:** Based on the visible data, Electronics appears to be the most common product category.

##### Excercise 00
Using what you've learned, create a code cell direclty below this one to answer these questions:

- What is the date range of our sales data? (Hint: try `min()` and `max()` functions)
- How many unique products do we sell?
- What is the average price by category?
- What is our most common payment method?

### Part 2 - Data Selection and Filtering

#### 02.01 - Column Selection
There are multiple ways to select columns in pandas. Let's explore them:

In [None]:
# Select single column
prices = sales_df['Price']

# Select multiple columns
product_info = sales_df[['Product', 'Price', 'Units']]

##### Question 17
Compare and contrast how we select multiple columns vs a single column.
- **Answer:** To select multiple columns, we use double square brackets with a list of column names, while for a single column, we use single square brackets. This syntax difference allows for selecting either a DataFrame subset or a single Series.

##### Excercise 01
Using what you've learned, create a code cell direclty below this one to select the 'Region' and 'Category' columns and store them in a variable called 'location_data'.

#### 02.01 - Filtering Data
Let's learn how to filter our data based on conditions:

In [None]:
expensive_items = sales_df[sales_df['Price'] > 500]

north_sales = sales_df[sales_df['Region'] == 'North']

print("Expensive items:")
print(expensive_items)
print("\nNorth region sales:")
print(north_sales)

##### Question 18
What is `sales_df['Price'] > 500` accomplishing in our code? How might we use this in data science workflows?
- **Answer:** The code `sales_df['Price'] > 500` is creating a boolean mask that filters the DataFrame to include only rows where the price is greater than 500. This type of filtering is crucial in data science workflows for segmenting data, identifying high-value transactions, or focusing analysis on specific subsets of the data.

##### Question 19
What logic operator would we use to combine multiple conditonals when filtering columns?
- **Answer:** o combine multiple conditionals when filtering columns, we would use logical operators such as &  or | . For example: `sales_df[(sales_df['Price'] > 500) & (sales_df['Region'] == 'North')]`

##### Excercise 02
Using what you've learned, create a code cell direclty below this one to create the following filters:
- The number of units sold was greater than 5
- The product category is 'Electronics'

#### Part 3 - Final Summary
Use your experience completing this lab, and the code block below to answer the following summary questions

##### Question 20
In your own words, explain what pandas is and why it's useful for data analysis:
- **Answer:** Pandas is a powerful Python library for data manipulation and analysis that provides data structures like DataFrames for efficiently handling structured data. It's useful for data analysis because it offers a wide range of functions for data cleaning, transformation, merging, and statistical operations, making it easier to prepare and analyze complex datasets.

##### Question 21
How many unique products are in the dataset?
- **Answer:** The exact number of unique products is not directly visible in the provided data excerpt. To determine this, we would need to run `sales_df['Product'].nunique()` on the full dataset.

##### Question 22
What was the most popular product category in the `North` region? 
- **Answer:** This information is not directly available from the given data excerpt.