<h1>Exploratory Analysis</h1>
<p><img src="images/1line.png" width="100%" /></p>
<ul>
<li>Let&rsquo;s load a sample Pandas DataFrame that we use throughout the lecture from this file: <a title="sales_data.csv" href="data/sales_data.csv" target="_blank" rel="noopener" >sales_data.csv</a>.</li>
<li>You can also load this content directly from the internet using this URL: <a href="https://github.com/dgregg/Jupyter/blob/master/Notebooks/data/sales_data.csv" target="_blank" rel="noopener">https://github.com/dgregg/Jupyter/blob/master/Notebooks/data/sales_data.csv</a></li>
<li>Note: This is a different (larger) data set than the one described in the lecture video - the columns are not idential.</li>
</ul>

In [1]:
import pandas as pd

# Create a DataFrame from the sales_data csv file located online
df = pd.read_csv("https://github.com/dgregg/Jupyter/blob/master/Notebooks/data/sales_data.csv")

<h3 class="wp-block-heading">Standard Deviation</h3>
<ul>
<li><code>.std()</code> is used for calculating the standard deviation.</li>
<li>The <strong>standard deviation</strong> measures how spread out data values are around the mean.
<ul>
<li>A low standard deviation indicates that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out over a much wider range.</li>
</ul>
</li>
<li>By default <code>.std()</code> the method will skip missing values.</li>
</ul>

In [2]:
# Calculating the standard deviation for an entire dataframe
print(df.std(numeric_only=True))

Quantity      1.417413
Sales       719.281639
Age          15.028258
dtype: float64


<h3 >Skew</h3>
<ul>
<li >Skewness measures the amount&nbsp; of asymmetry observed in data to the left or right of the mean.</li>
<li >Distributions can exhibit right (positive) skewness or left (negative) skewness to varying degrees.
<ul>
<li >A negative value for skewness indicates that the tail is on the left side of the distribution, which extends towards more negative values.</li>
<li >A positive value for skewness indicates that the tail is on the right side of the distribution, which extends towards more positive values.</li>
<li >A zero value indicates that there is no skewness in the distribution at all, meaning the distribution is "normally distributed".</li>
</ul>
</li>
<li>The larger the value for the skewness variable the more skewed the data, values &gt; 1 or &lt; -1 indicate the data is highly skewed.</li>
<li>We measure the skewness of a dataset using the <code>.skew()</code> method:</li>
</ul

In [3]:
# Calculating skewness using .skew()
print(df.skew(numeric_only=True))

Quantity   -0.003226
Sales       1.084130
Age        -0.000329
dtype: float64


<ul>
<li>In the example above, the <code>Sales</code> data has a high positive (right) skew, while <code>Quantity</code> and <code>age</code> data is almost symmetrical but is ever so slightly left leaning.</li>
</ul>
<p>&nbsp;</p>
<h3>Describing Summary Statistics</h3>
<ul>
<li>the Pandas <code>describe()</code> method shows a quick statistic summary of your data</li>
<li>The method provides a number of helpful statistics, such as the mean, standard deviation and quartiles of the data:</li>
</ul>

In [4]:
# Using the Pandas .describe() method
display(df.describe())

Unnamed: 0,Quantity,Sales,Age
count,16684.0,16684.0,16684.0
mean,3.007193,857.978725,43.500719
std,1.417413,719.281639,15.028258
min,1.0,40.66,18.0
25%,2.0,203.3,30.0
50%,3.0,600.17,44.0
75%,4.0,1200.34,57.0
max,5.0,3000.85,69.0


<h3>Identifying Unique Values</h3>
<ul>
<li>Another common DataFrame operation is identifying the unique values that are in a column.</li>
<li>Because data entry is often imperfect, finding unique values can help determine the data quality of a dataset.</li>

</ul>
<p>We can examine the unique values in the <code>Region</code> column using the <code>.unique()</code> method.</p>

In [11]:
# Getting Unique Values in the region Column
print(df['Region'].unique())

['South' 'North' 'West' 'East']


<h3>Using groupby</span></h3>
<ul>
<li><span class="ez-toc-section">The <code>groupby()</code> method in Pandas is used to group rows of data together based on the values in one or more columns. </span></li>
<li><span class="ez-toc-section">Once the data is grouped, you can apply various aggregation functions such as <code>sum()</code>, <code>mean()</code>, <code>count()</code>, etc. to calculate summary statistics for each group.</span></li>
<li>The code below groups the data by product category using the <code>groupby()</code> method, calculate the total sales for each product category using the <code>sum()</code> method, and print out the results using the <code>print()</code> function.</li>
</ul>

In [10]:
# Calculate total sales for each product category
total_sales = df.groupby('Category')['Sales'].sum()

# Print out the results
print("Total Sales by Category")
display(total_sales)

Total Sales by Category


Category
Clothing     8698418.96
Cosmetics     515853.42
Shoes        5100244.66
Name: Sales, dtype: float64

<h3 >Cross tabulation</h3>
<ul>
<li>Cross tabulation (crosstab) is an analysis tool used to compare the results for one or more variables with the results of another variable. </li>
<li>They allow you to examine relationships within the data that might not be obvious when simply looking at the overall data set.</li>
</ul>

In [7]:
# Creating a Region/Category Crosstab
tab = pd.crosstab(
    index=df['Region'],
    columns=df['Category'],
)

display(tab)

Category,Clothing,Cosmetics,Shoes
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,2556,1119,761
North,2601,1169,736
South,1965,827,563
West,2533,1096,758


<ul>
<li >By default Pandas will provide the counts across the categories in the crosstab.</li>
<li >Pandas also allows you to choose many other types of aggregation functions (e.g. sum, median mode, min, max ...) by passing both a <code>values=</code> parameter and an <code>aggfunc=</code> parameter</li>
<li>The example below aggregates the sum of the "Sales" column by Region and Category</li>
</ul>

In [9]:
# Aggregating Sales Using a Pandas Crosstab
tab = pd.crosstab(
    index=df['Region'],
    columns=df['Category'],
    values=df['Sales'],
    aggfunc='sum'
)

display(tab)

Category,Clothing,Cosmetics,Shoes
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,2309415.68,132551.6,1359985.22
North,2335222.56,143936.4,1341379.95
South,1780374.64,102747.82,1037693.93
West,2273406.08,136617.6,1361185.56


<ul>
<li>This type of data analysis allows us to easily see patterns in the data</li>
<li>For example, we can identify that regions with the highest and lowest sales as will as the best performing product categories.</li>
</ul>
<hr />

<h4>References</h4>
<ul>
<li>Pandas DataFrame Documentation: <a href="https://pandas.pydata.org/pandas-docs/stable/reference/frame.html" target="_blank" rel="noopener">https://pandas.pydata.org/pandas-docs/stable/reference/frame.html</a></li>
<li>Summarizing and Analyzing a Pandas DataFrame, datagy.io, 1/5/22, <a href="https://datagy.io/pandas-exploratory-data-analysis/" target="_blank" rel="noopener">https://datagy.io/pandas-exploratory-data-analysis/</a></li>
<li>Pandas DataFrames, W3 Schools: <a href="https://www.w3schools.com/python/pandas/pandas_dataframes.asp" target="_blank" rel="noopener">https://www.w3schools.com/python/pandas/pandas_dataframes.asp</a></li>
</ul>