# Dataset Report

The following report shows the use of Pandas to attain insights from a datset stored in a csv file. Part 1 is the code and instructions initially provided by HyperionDev, and Part 2 is my own code written to follow the instructions given.

## Part 1: Code and instructions pre-provided

In [41]:
# Import pandas
import pandas as pd

In [42]:
# Create a DataFrame with balance.txt.
df = pd.read_csv('balance.txt', delim_whitespace=True)

Write the code needed to produce a report that provides the following information:
* Compare the average income based on ethnicity.
* On average, do married or single people have a higher balance?
* What is the highest income in our dataset?
* What is the lowest income in our dataset?
* How many cards do we have recorded in our dataset? (Hint: use `sum()`)
* How many females do we have information for vs how many males? (Hint: use `count()` for a list of all methods for computation of descriptive stats, explore the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats)).




## Part 2: My code

Below is code which produces the report requested.

### 1. Compare the average income based on ethnicity.

In [43]:
# Dataframe of average income based on ethnicity.
df[['Ethnicity','Income']].groupby(['Ethnicity']).mean(numeric_only=True)

Unnamed: 0_level_0,Income
Ethnicity,Unnamed: 1_level_1
African American,47.682101
Asian,44.187833
Caucasian,44.521945


In [44]:
# Locates each row with a specified 'Ethnicity' column. Then calculates the average income for that Ethnicity.
income_str = "The average income of"
print(income_str, "an African American person is", df[df.Ethnicity =="African American"].loc[:,"Income"].mean())
print(income_str, "an Asian person is", df[df.Ethnicity =="Asian"].loc[:,"Income"].mean())
print(income_str, "a Caucasian person is", df[df.Ethnicity =="Caucasian"].loc[:,"Income"].mean())

The average income of an African American person is 47.68210101010099
The average income of an Asian person is 44.18783333333334
The average income of a Caucasian person is 44.521944723618084


### 2. On average, do married or single people have a higher balance?

In [53]:
# Dataframe showing average balance of single versus married people.
r_df = df[['Married','Balance']].groupby(['Married']).mean(numeric_only=True)
r_df

Unnamed: 0_level_0,Balance
Married,Unnamed: 1_level_1
No,13.493509
Yes,13.388473


In [59]:
# Locates the row where the 'Married' value is equal to 'No' and calculates the average balance.
avg_bal_single = r_df.loc[r_df.index == "No", 'Balance'].mean()

# Locates the row where the 'Married' value is equal to 'Yes' and calculates the average balance.
avg_bal_married = r_df.loc[r_df.index == "Yes", 'Balance'].mean()

# Conditional that prints a string based on which average balance is higher.
if avg_bal_single > avg_bal_married:
    print("On average, single people have a higher balance.")
elif avg_bal_single < avg_bal_married:
    print("On average, married people have a higher balance.")

On average, single people have a higher balance.


### 3. What is the highest income in our dataset?

In [46]:
# Locates row within the dataframe which has the maximum 'Income' column, then prints this value.
print("The highest income in our dataset is", df.loc[:,"Income"].max())

The highest income in our dataset is 186.634


### 4. What is the lowest income in our dataset?

In [47]:
# Locates row within the dataframe which has the minimum 'Income' column, then prints this value.
print("The lowest income in our dataset is", df.loc[:,"Income"].min())

The lowest income in our dataset is 10.354


### 5. How many cards do we have recorded in our dataset?

In [48]:
# Sum the value in the 'Cards' column for every row in the dataset, then print the sum.
print("There are", df.loc[:,"Cards"].sum(), "cards in our dataset.")

There are 1183 cards in our dataset.


### 6. How many females do we have information for vs how many males?

In [49]:
# Print header string.
print("Information held in the dataset by gender" + "\n"+("-"*45))

# Counts the number of rows with 'Female' in the 'Gender' column, then prints this value.
print("Female:", df[df.Gender == "Female"]["Gender"].count())

# Counts the number of rows with 'Male' in the 'Gender' column, then prints this value.
print("Male:", df[df.Gender == "Male"]["Gender"].count())


Information held in the dataset by gender
---------------------------------------------
Female: 207
Male: 193
