<h1>Loading CSV and SQL Data</h1><p><img src="images/1line.png" width="100%" /></p>
<ul>
<li>Have a look at this data frame, called `brics`.</li>
<li>It contains some basic information on the so called 'brics' countries: Brazil, Russia, India, China and South Africa.</li>
</ul>
<p><img style="display: block; margin-left: auto; margin-right: auto;" src="images/PandasDataFrame_brics.PNG" alt="Pandas DataFrame brics" width="537" height="300" /></p>
<ul>
<li>A dataframe is a table: the rows represent different entries or observations, which are countries in this case.</li>
<li>Each row has unique row label: BR for Brazil, RU for Russia and so on.</li>
<li>The columns represent different properties, and are identified by their column labels:
<ul>
<li>country, population, area and capital.</li>
</ul>
</li>
<li>The columns can have different types.</li>
</ul>
<hr />
<h3>Building a DataFrame</h3>
<ul>
<li>You typically import data from an external file to build a data frame: e.g. <a title="brics.txt" href="data/brics.txt" >brics.txt</a></li>
</ul>
<p><strong>brics.txt</strong></p>
<pre>,country,population,area,capital <br />BR,Brazil,200,8515767,Brasilia<br />RU,Russia,144,17098242,Moscow <br />IN,India,1252,3287590,New Delhi<br />CH,China,1357,9596961,Beijing <br />SA,South Africa,55,1221037,Pretori</pre>
<ul>
<li>The first line are the column names and the other lines are the rows of the table.</li>
<li>The code below will import the file into a Pandas DataFrame</li>
</ul>



In [6]:
import pandas as pd
# data file in sudirectory data
brics = pd.read_csv("data/brics.txt")
display(brics)

Unnamed: 0.1,Unnamed: 0,country,population,area,capital
0,BR,Brazil,200,8515767,Brasilia
1,RU,Russia,144,17098242,Moscow
2,IN,India,1252,3287590,New Delhi
3,CH,China,1357,9596961,Beijing
4,SA,South Africa,55,1221037,Pretori


<ul>
<li>To solve this, we'll have to tell the 'read_csv()' function that the first column contain the row indexes.</li>
</ul>

In [8]:
brics = pd.read_csv("data/brics.txt", index_col=0)
display(brics)

Unnamed: 0,country,population,area,capital
BR,Brazil,200,8515767,Brasilia
RU,Russia,144,17098242,Moscow
IN,India,1252,3287590,New Delhi
CH,China,1357,9596961,Beijing
SA,South Africa,55,1221037,Pretori


<ul>
<li>Pandas has the 'read_csv()'&nbsp; which allows you to import a csv file directly into a DataFrame.</li>
<li>If you do not specify the column that contains the "keys" for each row, Pandas will autogenerate a unique index for each row.<br /><br /></li>
</ul>

In [9]:
brics.index

Index(['BR', 'RU', 'IN', 'CH', 'SA'], dtype='object')

<h3>Read SQL</h3>
<ul>
<li>You can also read the data for a Pandas dataframe directly from a SQL query using read_sql_query</li>
<li>Just like with read_csv, if you do not specify the column that contains the "keys" for each row, Pandas will autogenerate a unique index for each row.</li>
<li>You can specify an index_col using the name of the field in the database table (the example below assumes the first field in the brics table is named field1)</li>
<li><a class="instructure_file_link inline_disabled" title="brics.sqlite" href="data/brics.squlit" target="_blank" >brics.sqlite</a>&nbsp;</li>
</ul>

In [10]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('data/brics.sqlite')
conn.text_factory = str
cur = conn.cursor()

brics = pd.read_sql_query("SELECT * FROM brics", con=conn, index_col="field1")

<ul>
<li>This time `brics` contains the dataframe that contains the row and column labels.</li>
</ul>

In [2]:
display(brics)

Unnamed: 0_level_0,country,population,area,capital
field1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR,Brazil,200,8515767,Brasilia
RU,Russia,144,17098242,Moscow
IN,India,1252,3287590,New Delhi
CH,China,1357,9596961,Beijing
SA,South Africa,55,1221037,Pretori


<p>&nbsp;</p>
<h3>Adding a Column</h3>
<ul>
<li>To add a column you place the new column name in square brackets and then assign a list to it.</li>
<li>Note: You can also read the column from a second CSV file.</li>
</ul>

In [3]:
brics["on_earth"] = [True, True, True, True, True]
display(brics)

Unnamed: 0_level_0,country,population,area,capital,on_earth
field1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BR,Brazil,200,8515767,Brasilia,True
RU,Russia,144,17098242,Moscow,True
IN,India,1252,3287590,New Delhi,True
CH,China,1357,9596961,Beijing,True
SA,South Africa,55,1221037,Pretori,True


<ul>
<li>You can even make a calculated column, based on other columns.</li>
<li>Here we make a column with the population density.</li>
<li>In pandas we can simply divide the population by the area (you cannot do this with regular Python lists).</li>
<li>We multiply by one million because the population columns is expressed in millions.</li>
</ul>

In [4]:
brics["density"] = brics["population"] / brics["area"] * 1000000
display(brics)

Unnamed: 0_level_0,country,population,area,capital,on_earth,density
field1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BR,Brazil,200,8515767,Brasilia,True,23.485847
RU,Russia,144,17098242,Moscow,True,8.421918
IN,India,1252,3287590,New Delhi,True,380.826076
CH,China,1357,9596961,Beijing,True,141.398928
SA,South Africa,55,1221037,Pretori,True,45.04368


<hr><h3>References</h3>
<p>Learn Python.org, Pandas Basics, <a href="https://www.learnpython.org/en/Pandas_Basics" target="_blank" rel="noopener">https://www.learnpython.org/en/Pandas_Basics</a></p>