# Census Case Study

## Census Case Study
- Preparing SQLAlchemy and the Database
- Loading Data into the Database
- Solving Data Science Problems with Queries

## Part 1: Preparing SQLAlchemy and the Database
- Create an Engine and MetaData object

In [1]:
from sqlalchemy import create_engine, MetaData

engine = create_engine('sqlite:///census.sqlite')
metadata = MetaData()

## Part 1: Preparing SQLAlchemy and the Database
- Create and save the census table

In [2]:
from sqlalchemy import (Table, Column, String, Integer, Float, Boolean)

employees = Table('employees', metadata,
                  Column('id', Integer()),
                  Column('name', String(255)),
                  Column('active', Boolean()))

metadata.create_all(engine)

---
# Let’s practice!

In [3]:
# Import create_engine, MetaData
from sqlalchemy import create_engine, MetaData

# Define an engine to connect to chapter5.sqlite: engine
engine = create_engine('sqlite:///chapter5.sqlite')
connection = engine.connect()
# Initialize MetaData: metadata
metadata = MetaData()

In [4]:
# Import Table, Column, String, and Integer
from sqlalchemy import Table, Column, String, Integer

# Build a census table: census
census = Table('census', metadata,
               Column('state', String(30)),
               Column('sex', String(1)),
               Column('age', Integer()),
               Column('pop2000', Integer()),
               Column('pop2008', Integer()))

# Create the table in the database
metadata.create_all(engine)

# Populating the Database

## Part 2: Populating the Database
- Load a CSV file into a values list

```python
In [7]: values_list = []
In [8]: for row in csv_reader:
 ...: data = {'state': row[0], 'sex': row[1],
 ...: 'age': row[2], 'pop2000': row[3],
 ...: 'pop2008': row[4]}
 ...: values_list.append(data)```

## Part 2: Populating the Database
- Insert the values list into the census table

```python
In [9]: from sqlalchemy import insert
In [10]: stmt = insert(employees)
In [11]: result_proxy = connection.execute(stmt,
 ...: values_list)
In [12]: print(result_proxy.rowcount)
Out[12]: 2
    ```

---
# Let’s practice!

In [5]:
import csv

ifile = open('census.csv', 'r')
csv_reader = csv.reader(ifile)

In [6]:
# Create an empty list: values_list
values_list = []

# Iterate over the rows
for row in csv_reader:
    # Create a dictionary with the values
    data = {'state': row[0], 'sex': row[1], 'age':row[2], 'pop2000': row[3],
            'pop2008': row[4]}
    # Append the dictionary to the values list
    values_list.append(data)
    
ifile.close()

In [8]:
# Import insert
from sqlalchemy import insert

# Build insert statement: stmt

stmt = insert(census)
# Use values_list to insert data: results

results = connection.execute(stmt, values_list)

# Print rowcount
print(results.rowcount)

8772


---
# Querying the Database

## Part 3: Answering Data Science Questions with Queries
- Determine Average Age for Males and Females


```PYTHON

from sqlalchemy import select

# build a select statement that calculates the weighted average
# summing the result of multipliing the age of the population
# then dividing that by the sum of the total population
# and labeling all of that as average_age

stmt = select([census.columns.sex,
               func.sum(census.columns.pop2008 *
                        census.columns.age) /
               func.sum(census.columns.pop2008)
               ).label('average_age')])


# group by teh sex column to determine the average age of each sex
stmt = stmt.group_by('census.columns.sex')


# execute the query and fetch all teh results
results = connection.execute(stmt).fetchall()


```

## Part 3: Answering Data Science Questions with Queries
- Determine the percentage of Females for each
state

Calculate the percentage by using case and cast clauses 

```python
from sqlalchemy import case, cast, Float
'''
build a select statement that calculates the sum of the population
from the 2008 column in cases where the state is New York.
Then we divide it by the sum of the total population of 2008 that 
is cast in to a Float so we can get a deciman value.
We then multiply 100 to get a percentage 
and we label all that as ny_percent
'''
stmt = select(
    (func.sum(
        case([
        (census.columns.state == 'New York',
         census.columns.pop2008)
    ], else_=0)) /
    cast(func.sum(census.columns.pop2008),
         Float) * 100).label('ny_percent')])
    ```


## Part 3: Answering Data Science Questions with Queries
- Determine the top 5 states by population change from 2000 to 2008


Calculate the difference between two columns grouped by another column.
```python
'''
Build a select statement that selects the columns we want to determine the  change by(age)
Then we calculate the difference in the population between 2008 and 2000
and we label all of that as pop_change
make sure you wrap the difference in paranthesis so it can be label.
'''

stmt = select([census.columns.age,
               (census.columns.pop2008-census.columns.pop2000).label('pop_change')])


# order it by pop_change
stmt = stmt.order_by('pop_change')

# limit it to just 5 results.
stmt = stmt.limit(5)   
    ```

---
# Let’s practice!

```python
# Import select
from sqlalchemy import select

# Calculate weighted average age: stmt
stmt = select([census.columns.sex,
               (func.sum(census.columns.pop2008 * census.columns.age) /
                func.sum(census.columns.pop2008)).label('average_age')
               ])

# Group by sex
stmt = stmt.group_by(census.columns.sex)

# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()

# Print the average age by sex
for result in results:
    print(result.sex, result.average_age)
    
    ```

```python
# import case, cast and Float from sqlalchemy
from sqlalchemy import case, cast, Float

# Build a query to calculate the percentage of females in 2000: stmt
stmt = select([census.columns.state,
    (func.sum(
        case([
            (census.columns.sex == 'F', census.columns.pop2000)
        ], else_=0)) /
     cast(func.sum(census.columns.pop2000), Float) * 100).label('percent_female')
])

# Group By state
stmt = stmt.group_by(census.columns.state)

# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()

# Print the percentage
for result in results:
    print(result.state, result.percent_female)
    ```


```python
# Build query to return state name and population difference from 2008 to 2000
stmt = select([census.columns.state,
     (census.columns.pop2008-census.columns.pop2000).label('pop_change')
])

# Group by State
stmt = stmt.group_by(census.columns.state)

# Order by Population Change
stmt = stmt.order_by(desc('pop_change'))

# Limit to top 10
stmt = stmt.limit(10)

# Use connection to execute the statement and fetch all results
results = connection.execute(stmt).fetchall()

# Print the state and population change for each record
for result in results:
    print('{}:{}'.format(result.state, result.pop_change))
    ```