In [None]:
# %pip install numpy pandas sqlalchemy

In [1]:
import numpy as np
import pandas as pd

## Regularizing, Splitting Text Data

Oftentimes, string data contains multiple pieces of data inside it, split with a seperator character.  With it, you can turn a DataFrame from this:

| line |
| :--: |
| hi_1 |
| bye_2|

into this:

| line | msg | num |
| :--: | :--: | :--: |
| hi_1 | hi | 1 |
| bye_2| bye | 2 |

using a single line:

```python
df[['msg', 'num']] = df['line'].str.split('_', expand=True)
```


Let's try it out!

In [2]:
df = pd.DataFrame({
    'counts_XADD': ["1;3;5", "10;2;6"],
    'intensities_JJAKX': ['5_32_654', "10_1_99"],
})
df

Unnamed: 0,counts_XADD,intensities_JJAKX
0,1;3;5,5_32_654
1,10;2;6,10_1_99


First, rename the columns `counts_XADD` and `intensities_JJAKX` to just keep the part of the names before the underscore. For that, we can use the rename function in pandas, which has the following syntax:

```python
df = df.rename(columns={'original_column_name1': 'new_column_name1', 
                        'orignal_column_name2': 'new_column_name2'})
```

In [3]:
df = df.rename(columns={'counts_XADD': 'counts', 'intensities_JJAKX': 'intensities'})
df

Unnamed: 0,counts,intensities
0,1;3;5,5_32_654
1,10;2;6,10_1_99


Split the Counts into Counts_1, Counts_2, and Counts_3

In [4]:
df[['Counts_1', 'Counts_2', 'Counts_3']] = df['counts'].str.split(';', expand = True)
del df['counts']
df

Unnamed: 0,intensities,Counts_1,Counts_2,Counts_3
0,5_32_654,1,3,5
1,10_1_99,10,2,6


Split the Intensities into Intensities_1, Intensities_2, and Intensities_3

In [5]:
df[['Intensities_1', 'Intensities_2', 'Intensities_3']] = df['intensities'].str.split('_', expand = True)
del df['intensities']
df

Unnamed: 0,Counts_1,Counts_2,Counts_3,Intensities_1,Intensities_2,Intensities_3
0,1,3,5,5,32,654
1,10,2,6,10,1,99


## Out-of-core Reshaping Operations: Joins with SQL Queries

Does anyone in your group know some SQL?  You can write to and read from tables in any SQL database using the package sqlalchemy, as well as send custom queries!

| Function | Description |
| :---:    | :----:      |
| `create_engine()` | Describe how sqlalchemy should find and connect to your database |
| `engine.connect()` | Make an open connection to the database (similar to opening a file) |
| `DetaFrame.to_sql("table_name", conn)` | Write to a table in a database you have an open connection to |
| `pd.read_sql_table("table_name", conn)`, | Read from a table in a databae you have an open connection to |
| `pd.read_sql_query("SELECT * FROM table_name", conn)`, | Read from a query in a databae you have an open connection to |

In [6]:
from sqlalchemy import create_engine

In [7]:
%pip install sqlalchemy

Note: you may need to restart the kernel to use updated packages.


### Create and Population the Database

In [11]:

with create_engine("sqlite:///people.db").connect() as conn:
    pd.DataFrame({'Name': ['Paul', 'Arash', 'Jenny'], 'Age': [16, 19, 17]}).to_sql("ages", conn, index=False)
    pd.DataFrame({'Name': ['Arash', 'Paul', 'Sara'], 'Weight': [32, 15, 37]}).to_sql("weights", conn, index=False)
    pd.DataFrame({'Name': ['Amy', 'Paul', 'Sara'], 'Height': [170, 190, 143]}).to_sql("heights", conn)

### Examples: Read from the Database

In [9]:
with create_engine("sqlite:///people.db").connect() as conn:
    df = pd.read_sql_table("ages", conn)
df

Unnamed: 0,Name,Age
0,Paul,16
1,Arash,19
2,Jenny,17


In [10]:
query = """
SELECT Age FROM ages
"""
with create_engine("sqlite:///people.db").connect() as conn:
    df = pd.read_sql_query(query, conn)
df

Unnamed: 0,Age
0,16
1,19
2,17


### Optional Exercise

What kinds of queries can we make on this data?

In [94]:
query = """
SELECT Weight FROM weights
"""
with create_engine("sqlite:///people.db").connect() as conn:
    df = pd.read_sql_query(query, conn)
df

Unnamed: 0,Weight
0,32
1,15
2,37


In [95]:
query = """
SELECT Height FROM heights
"""

with create_engine("sqlite:///people.db").connect() as conn:
    df = pd.read_sql_query(query, conn)
df

Unnamed: 0,Height
0,170
1,190
2,143


In [97]:
query = """
SELECT Name from heights
"""

with create_engine("sqlite:///people.db").connect() as conn:
    df = pd.read_sql_query(query, conn)

df

Unnamed: 0,Name
0,Amy
1,Paul
2,Sara
