# Spreadsheets
- Also know as Excel files
- Data stored in tabular form, with cells arranged in rows and coluns
- Unlike flat files, can hve formatting and formulas
- Multiple spreadsheets files, can have formatting and formulas

**Loading Spreadsheets**
- Spreadsheets have their own loading function in <code>pandas</code>: <code>read_excel()</code>

In [1]:
import pandas as pd

# Read the Excel file
survey_data = pd.read_excel('datasets/fcc-new-coder-survey.xlsx')
survey_data.head(3)

Unnamed: 0,"FreeCodeCamp New Developer Survey Responses, 2016",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 88,Unnamed: 89,Unnamed: 90,Unnamed: 91,Unnamed: 92,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97
0,Source: https://www.kaggle.com/freecodecamp/20...,,,,,,,,,,...,,,,,,,,,,
1,Age,AttendedBootcamp,BootcampFinish,BootcampLoanYesNo,BootcampName,BootcampRecommend,ChildrenNumber,CityPopulation,CodeEventConferences,CodeEventDjangoGirls,...,ResourcePluralSight,ResourceSkillCrush,ResourceStackOverflow,ResourceTreehouse,ResourceUdacity,ResourceUdemy,ResourceW3Schools,SchoolDegree,SchoolMajor,StudentDebtOwe
2,28,0,,,,,,"between 100,000 and 1 million",,,...,,,,,,,,"some college credit, no degree",,20000


**Loading Select Columns and Rows**
- <code>read_excel()</code> has many keywords arguments in common with <code>read_csv()</code>
    - <code>nrows</code>: limit number of rows to load
    - <code>skiprows</code>: specify number of rows or row numbers to skip
    - <code>usecols</code>: choose columns by name, positional number, or letter (e.g. "A:P")

In [15]:
survey_data = pd.read_excel('datasets/fcc-new-coder-survey.xlsx', 
                            skiprows = 2,
                            usecols="A:E, H")
survey_data.head(5)

Unnamed: 0,Age,AttendedBootcamp,BootcampFinish,BootcampLoanYesNo,BootcampName,CityPopulation
0,28.0,0.0,,,,"between 100,000 and 1 million"
1,22.0,0.0,,,,"between 100,000 and 1 million"
2,19.0,0.0,,,,more than 1 million
3,26.0,0.0,,,,more than 1 million
4,20.0,0.0,,,,"between 100,000 and 1 million"


**Selecting Sheets to Load**
- <code>read_excel()</code> loads the first sheet in an Excel file by default
- Use the <code>sheet_name</code> keyword argument to load other sheets
- Specify spreadsheets by name and/or (zero-indexed) position number
- Pass a list of names/numbers to load more than one sheet at a time.
- Any arguments passed to <code>read_excel()</code> apply to all sheets read.

In [8]:
# Get the second sheet by position index

survey_data_sheet2 = pd.read_excel('datasets/fcc-new-coder-survey.xlsx',
                                   skiprows=2,
                                   sheet_name=1)

# Get the second sheet by name
survey_data_2017 = pd.read_excel('datasets/fcc-new-coder-survey.xlsx',
                                skiprows=2,
                                sheet_name='2017')

print(survey_data_sheet2.equals(survey_data_2017))

True


**Loading All Sheets**
- Passing <code>sheet_name=None</code> to <code>read_excel()</code> reads all sheets in a workbook


In [13]:
survey_responses = pd.read_excel('datasets/fcc-new-coder-survey.xlsx', 
                                 sheet_name=None,
                                 skiprows=2)

print(type(survey_responses))

<class 'dict'>


In [14]:
for key, value in survey_responses.items():
    print(key, type(value))

2016 <class 'pandas.core.frame.DataFrame'>
2017 <class 'pandas.core.frame.DataFrame'>


In [17]:
survey_responses['2016'].head(3)

Unnamed: 0,Age,AttendedBootcamp,BootcampFinish,BootcampLoanYesNo,BootcampName,BootcampRecommend,ChildrenNumber,CityPopulation,CodeEventConferences,CodeEventDjangoGirls,...,ResourcePluralSight,ResourceSkillCrush,ResourceStackOverflow,ResourceTreehouse,ResourceUdacity,ResourceUdemy,ResourceW3Schools,SchoolDegree,SchoolMajor,StudentDebtOwe
0,28.0,0.0,,,,,,"between 100,000 and 1 million",,,...,,,,,,,,"some college credit, no degree",,20000.0
1,22.0,0.0,,,,,,"between 100,000 and 1 million",,,...,,,,,,1.0,,"some college credit, no degree",,
2,19.0,0.0,,,,,,more than 1 million,,,...,,,,,,,,high school diploma or equivalent (GED),,


In [18]:
survey_responses['2017'].head(3)

Unnamed: 0,Age,AttendedBootcamp,BootcampFinish,BootcampLoanYesNo,BootcampName,BootcampRecommend,ChildrenNumber,CityPopulation,CodeEventConferences,CodeEventDjangoGirls,...,ResourcePluralSight,ResourceSkillCrush,ResourceStackOverflow,ResourceTreehouse,ResourceUdacity,ResourceUdemy,ResourceW3Schools,SchoolDegree,SchoolMajor,StudentDebtOwe
0,27.0,0.0,,,,,,more than 1 million,,,...,,,,,,1.0,1.0,"some college credit, no degree",,
1,34.0,0.0,,,,,,"less than 100,000",,,...,,,1.0,,,1.0,1.0,"some college credit, no degree",,
2,21.0,0.0,,,,,,more than 1 million,,,...,,,,,1.0,1.0,,high school diploma or equivalent (GED),,


**Putting It All Together**

In [24]:
# Create empty dataframe to hold all loaded sheets
all_responses = pd.DataFrame()

# Iterate through dataframes in dictionary
for sheet_name, frame in survey_responses.items():
    # Add a column so we know which year data is from
    frame['Year'] = sheet_name
    # Add each dataframe to all_responses
    all_responses = all_responses.append(frame)
    
# View years in data
print(all_responses.Year.unique())

['2016' '2017']


  all_responses = all_responses.append(frame)


**pandas and Booleans**
- <code>pandas</code> <code>True</code>/<code>False</code> columns as float data by default
- Specify a column should be <code>bool</code> with <code>read_excel()</code>'s <code>dtype</code> argument
- Boolean columns can only have <code>True</code> and <code>False</code> values
- **NA/missing values in Boolean columns are changed to <code>True</code>**
- <code>pandas</code> automatically recognizes some values as <code>True</code>/<code>False</code> in Boolean columns.
- Unrecognized values in a Boolean column are also changed to <code>True</code>

In [39]:
# Load data, casting True/False columns as Boolean

bool_data = pd.read_excel('datasets/fcc-new-coder-survey.xlsx',
                          skiprows=2,
                          dtype={'AttendedBootCampYesNo':bool,
                               'AttendedBootcampTF':bool,
                               'BootcampLoan':bool,
                               'LoanYesNo':bool,
                               'LoanTF':bool})

**Setting Custom True/False Values**

- Use <code>read_excel()</code>'s <code>true_values</code> argument to set custom <code>True</code> values.
- Use <code>false_values</code> to set custom <code>False</code> values
- Each takes a list of values to treat as <code>True</code>/<code>False</code>, respectively
- Custom <code>True</code>/<code>False</code> values are only applied to columns set as Boolean

In [42]:
# Load data with Boolean dtypes and custom T/F values
bool_data = pd.read_excel('datasets/fcc-new-coder-survey.xlsx',
                          skiprows=2,
                          dtype={'AttendedBootCamp':bool,
                               'AttendedBootCampYesNo':bool,
                               'AttendedBootcampTF':bool,
                               'BootcampLoan':bool,
                               'LoanYesNo':bool,
                               'LoanTF':bool},
                          true_values=['Yes'],
                          false_values=['No'])

In [44]:
print(bool_data.sum())

Age                      29524.0
AttendedBootcamp            37.0
BootcampFinish              21.0
BootcampLoanYesNo           14.0
BootcampRecommend           26.0
                          ...   
ResourceStackOverflow        9.0
ResourceTreehouse           19.0
ResourceUdacity            257.0
ResourceUdemy              320.0
ResourceW3Schools            5.0
Length: 75, dtype: object


  print(bool_data.sum())
