<ul>
<li>Let&rsquo;s load a sample Pandas DataFrame that we use throughout the lecture from this file: <a title="sales_data.csv" href="data/sales_data.csv" target="_blank" rel="noopener" >sales_data.csv</a>.</li>
<li>You can also load this content directly from the internet using this URL: <a href="https://github.com/dgregg/Jupyter/blob/master/Notebooks/data/sales_data.csv" target="_blank" rel="noopener">https://github.com/dgregg/Jupyter/blob/master/Notebooks/sales_data.csv</a></li>
<li>Note: This is a different (larger) data set than the one described in the lecture video - the columns are not idential.</li>
</ul>

In [5]:
import pandas as pd

# Create a DataFrame from the sales_data csv file in the data directory
df = pd.read_csv("data/sales_data.csv")

display(df)

Unnamed: 0,Date,Category,Quantity,Sales,Type,Region,Gender,Age
0,1/1/2022,Clothing,5,1500.40,Cash,South,Male,65
1,1/1/2022,Cosmetics,5,203.30,Cash,South,Male,55
2,1/1/2022,Clothing,2,600.16,Debit Card,North,Female,35
3,1/1/2022,Clothing,1,300.08,Credit Card,North,Female,18
4,1/1/2022,Clothing,1,300.08,Debit Card,West,Female,63
...,...,...,...,...,...,...,...,...
16679,12/31/2022,Clothing,1,300.08,Credit Card,South,Female,34
16680,12/31/2022,Cosmetics,3,121.98,Credit Card,East,Male,69
16681,12/31/2022,Clothing,2,600.16,Credit Card,North,Female,68
16682,12/31/2022,Shoes,1,600.17,Debit Card,North,Male,36


<p>The dataframe has eight columns:</p>
<ol>
<li>A <code>Date</code> column that holds the transaction date</li>
<li><code>Region</code>, <code>Category</code>, <code>Type</code>, and <code>Gender</code> columns that contain categorical/text variables</li>
<li>A <code>Quantity</code>  column that holds the number of units sold</li>
<li>A <code>Sales</code> column that holds the total amount of a sale</li>
<li>An <code>Age</code> column that holds the ace of the shopper</li>
</ol>
<p>We can also see that the DataFrame has 16684 rows, even though only the top five and bottome five rows are displayed.</p>
<h3>Acessing cells, rows and columns</h3>
<h4>Columns</h4>
<ul>
<li>To access a specific column, you use the index notation (like from lists and dictionaries) with square brackets (e.g. df["Sales"])</li>
<li>You can also use the dot notation (Notice that in this case, you don't have to use quotation marks, e.g. df.Units).</li>
</ul>

In [4]:
#Get one column from data frame
print(df["Sales"])
print(df.Quantity)

0      242.40
1       15.15
2       15.15
3      378.75
4      378.75
        ...  
995    573.44
996    322.56
997     35.84
998     35.84
999     36.84
Name: Sales, Length: 1000, dtype: float64
0      4
1      1
2      1
3      5
4      5
      ..
995    4
996    3
997    1
998    1
999    1
Name: Units, Length: 1000, dtype: int64


<h4>Cells</h4>
<ul>
<li>To access individual cells, you use the column name followed by the row name.</li>
</ul>

In [5]:
# Get the value in one cell
print("Sales Row 995:", df["Sales"][995])

Sales Row 995: 573.44


<h4>Rows</h4>
<ul>
<li>You can access an entire row from a data frame using the .loc() method.</li>
</ul>

In [6]:
#Get one row from the daaframe
print("Row 995")
print(df.loc[995])

Row 995
Date         12/23/2022
Region             East
Type       Books & Toys
Units                 4
Price            143.36
Sales            573.44
Payment            Cash
Gender           Female
Age                  49
Name: 995, dtype: object


<h3 class="wp-block-heading">Finding the Mean</h3>
<ul>
<li >Use, the <code>.mean()</code> method to calculate the mean of a Pandas DataFrame </li>
<li>DataFrame methods (like <code>.mean()</code>) can be applied to a single column (representing a Series) or to multiple columns</li>
</ul>

In [3]:
# Calculate the mean for the Sales column
avg_sales = df['Sales'].mean()
print(avg_sales)

857.978724526611


<ul>
<li>Applying the <code>.mean()</code> method to a single column returns a single numeric value.</li>
<li>This means we can easily assign it to a variable (avg_sales) so it can be reused throught your progam.</li>
<li>You can also apply applied the <code>.mean()</code> method to an entire DataFrame.</li>
<li>When applying a data analysis method to an entire dataframe you need to specify that only NUMERIC  columns should be analyzed. This requires you set the <code>numeric_only = True</code> parameter.</li>
</ul>

In [4]:
# Calculate the average for all numeric columns in a dataframe
avg_df = df.mean(numeric_only=True)
print(avg_df)

print("Average of Sales:", avg_df["Sales"])

Quantity      3.007193
Sales       857.978725
Age          43.500719
dtype: float64
Average of Sales: 857.978724526611


<ul>
<li>This returns a pandas Series &ndash; which allows you to access individual values using the index.</li>
</ul>
<p >&nbsp;</p>
<h3 >Reading Documentation</h3>
<ul>
<li>Knowing how to read documentation is one important way to improveyour programming skill (especially if you want to explore newer functionality that ChatGPT does not yet know about).</li>
<li>The documentation for the <code>.mean()</code> method, can be found at <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html" target="_blank" rel="noopener">pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html</a>.</li>
</ul>
<figure>
<figcaption><img src="images/Documentation.PNG" alt="The documentation for the Pandas .mean() method"   /></figcaption>
</figure>
<p>There are four main sections to the pandas documentation:</p>
<ol>
<li><strong>Method Name</strong>: we can see here, for example that we&rsquo;re looking at the DataFrame method (rather than the Series) method</li>
<li><strong>Description</strong>: this provides a plain English description of what the method does</li>
<li><strong>Inputs Allowed</strong>: the different parameters the method allows (in Pandas not all parameters are required). It also describes what the paramers do and what the default values are (if a paramerer is not provided). For example, Pandas will, by default, skip over missing data.</li>
<li><strong>Output Produced</strong>: what the method returns (i.e., what to expect)</li>
</ol>
<p>There are many other computational and descriptive statistics methods available in Pandas.  You can find most of them in the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats" target="_blank" rel="noopener">DataFrame Documentation</a></p>
<hr />

<h4>References</h4>
<ul>
<li>Pandas DataFrame Documentation: <a href="https://pandas.pydata.org/pandas-docs/stable/reference/frame.html" target="_blank" rel="noopener">https://pandas.pydata.org/pandas-docs/stable/reference/frame.html</a></li>
<li>Summarizing and Analyzing a Pandas DataFrame, datagy.io, 1/5/22, <a href="https://datagy.io/pandas-exploratory-data-analysis/" target="_blank" rel="noopener">https://datagy.io/pandas-exploratory-data-analysis/</a></li>
<li>Pandas DataFrames, W3 Schools: <a href="https://www.w3schools.com/python/pandas/pandas_dataframes.asp" target="_blank" rel="noopener">https://www.w3schools.com/python/pandas/pandas_dataframes.asp</a></li>
</ul>