# Pandas Fundamentals

In [1]:
import numpy as np
import pandas as pd

## Series

In [2]:
aa = [1,2,3,4]
ab = (5,6,7,8)
ac = {9,10,11,12}
ad = {'first':13,'second':14,'third':15,'fourth':16}
ser1 = pd.Series(aa)
ser2 = pd.Series(ab)
ser3 = pd.Series(list(ac))
ser4 = pd.Series(ad)
print(ser1)
print(ser2)
print(ser3)
print(ser4)

0    1
1    2
2    3
3    4
dtype: int64
0    5
1    6
2    7
3    8
dtype: int64
0     9
1    10
2    11
3    12
dtype: int64
first     13
second    14
third     15
fourth    16
dtype: int64


In [3]:
ser5 = pd.Series(17,18,19,20)
ser5
# The TypeError I recieve when running this code is Index(...) must be called with a collection of some kind, 18 was passed
# Must be specified as a collection of some sort using either a list[], a tuple(), or a dict{key:value}

TypeError: Index(...) must be called with a collection of some kind, 18 was passed

## DataFrames from Series

In [4]:
df1 = pd.DataFrame(ser1)
df1

Unnamed: 0,0
0,1
1,2
2,3
3,4


In [5]:
df1.insert(1, 1, ser2)
df1.insert(2, 2, ser3)
df1

Unnamed: 0,0,1,2
0,1,5,9
1,2,6,10
2,3,7,11
3,4,8,12


In [6]:
df1.insert(3, 3, ser4)
df1

Unnamed: 0,0,1,2,3
0,1,5,9,
1,2,6,10,
2,3,7,11,
3,4,8,12,


When adding a new column using the series ser4, the values all returned as NaN because ser4 is the only series created from a dictionary which means the keys automatically return as the index(or rows) for a DataFrame but in this case our DataFrame already has rows

In [7]:
df1.loc[4] = [5,9,12,0]
df1

Unnamed: 0,0,1,2,3
0,1,5,9,
1,2,6,10,
2,3,7,11,
3,4,8,12,
4,5,9,12,0.0


## Reading Data into a DataFrame

In [11]:
# Import sakila dataset .csv
df2 = pd.read_csv('sakila_example.csv')
df2.head()

Unnamed: 0,title,description,category,language,actor_name
0,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a Mad Scientist...,Documentary,English,PENELOPE GUINESS
1,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a Mad Scientist...,Documentary,English,CHRISTIAN GABLE
2,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a Mad Scientist...,Documentary,English,LUCILLE TRACY
3,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a Mad Scientist...,Documentary,English,SANDRA PECK
4,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a Mad Scientist...,Documentary,English,JOHNNY CAGE


In [12]:
df2.tail(10)

Unnamed: 0,title,description,category,language,actor_name
252,ATTACKS HATE,A Fast-Paced Panorama of a Technical Writer An...,Sci-Fi,English,MILLA KEITEL
253,ATTACKS HATE,A Fast-Paced Panorama of a Technical Writer An...,Sci-Fi,English,GROUCHO DUNST
254,ATTACKS HATE,A Fast-Paced Panorama of a Technical Writer An...,Sci-Fi,English,BURT TEMPLE
255,ATTRACTION NEWTON,A Astounding Panorama of a Composer And a Fris...,New,English,UMA WOOD
256,ATTRACTION NEWTON,A Astounding Panorama of a Composer And a Fris...,New,English,RIP WINSLET
257,ATTRACTION NEWTON,A Astounding Panorama of a Composer And a Fris...,New,English,GARY PENN
258,ATTRACTION NEWTON,A Astounding Panorama of a Composer And a Fris...,New,English,CHRISTOPHER WEST
259,AUTUMN CROW,A Beautiful Tale of a Dentist And a Mad Cow wh...,Games,English,DUSTIN TAUTOU
260,AUTUMN CROW,A Beautiful Tale of a Dentist And a Mad Cow wh...,Games,English,ANGELA HUDSON
261,AUTUMN CROW,A Beautiful Tale of a Dentist And a Mad Cow wh...,Games,English,JAMES PITT


In [15]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 262 entries, 0 to 261
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        262 non-null    object
 1   description  262 non-null    object
 2   category     262 non-null    object
 3   language     262 non-null    object
 4   actor_name   262 non-null    object
dtypes: object(5)
memory usage: 10.4+ KB


Based on the results above from the .info() attribute, I can determine that this dataset has 262 total entries with 5 total columns. The column name are title, description, category, language, and actor_name. Each column share the same data type being 'object' or string data type with no null values.

In [16]:
df2.describe()

Unnamed: 0,title,description,category,language,actor_name
count,262,262,262,262,262
unique,46,46,15,1,145
top,ARABIA DOGMA,A Touching Epistle of a Madman And a Mad Cow w...,Horror,English,OPRAH KILMER
freq,12,12,55,262,8


Based on the results from the .describe() attribute, I believe some useful insights would be top and freq as shows the most frequent value in the column while also showing how frequently that value appeared within the column. Unique, while although less useful than top and freq, still gives useful insights as well as the its shows the number of unquie values within the column. Count isn't so useful here as we already determined this in the previous code cell when using the .info() attribute.

In [17]:
# Import northwind dataset .csv
df3 = pd.read_csv('northwind_example.csv')
df3

Unnamed: 0,OrderID,ProductName,UnitPrice,Quantity
0,10248,Queso Cabrales,14.0,12
1,10248,Singaporean Hokkien Fried Mee,9.8,10
2,10248,Mozzarella di Giovanni,34.8,5
3,10249,Manjimup Dried Apples,42.4,40
4,10249,Tofu,18.6,9
...,...,...,...,...
495,10435,Mozzarella di Giovanni,27.8,10
496,10436,Gnocchi di nonna Alice,30.4,40
497,10436,Wimmers gute Semmelkndel,26.6,30
498,10436,Rhnbru Klosterbier,6.2,24


In [18]:
# Show column statistics on df3
df3.describe()

Unnamed: 0,OrderID,UnitPrice,Quantity
count,500.0,500.0,500.0
mean,10341.364,23.3714,24.33
std,54.442663,27.334436,18.069702
min,10248.0,2.0,1.0
25%,10294.0,10.4,10.0
50%,10341.0,15.8,20.0
75%,10389.0,27.8,32.0
max,10436.0,210.8,120.0


Based on the results from the .describe() attribute--

## Coding a DataFrame

In [19]:
test_data = {
    'ID':[1,2,3,4,5,6,7,8],
    'Name':['Alicia','Bob','Charlie','Dominic','Eve','Frank','Grace','Heidi'],
    'Age':[25,np.nan,22,19,31,35,np.nan,18],
    'City':['NY','LA','Chicago','Houston','Phoenix','Boston','Austin','San Diego'],
    'Score':[85,92,-1,65,78,np.nan,55,90]
}

In [20]:
df4 = pd.DataFrame(test_data)
df4

Unnamed: 0,ID,Name,Age,City,Score
0,1,Alicia,25.0,NY,85.0
1,2,Bob,,LA,92.0
2,3,Charlie,22.0,Chicago,-1.0
3,4,Dominic,19.0,Houston,65.0
4,5,Eve,31.0,Phoenix,78.0
5,6,Frank,35.0,Boston,
6,7,Grace,,Austin,55.0
7,8,Heidi,18.0,San Diego,90.0


In [21]:
df4.to_csv('test_data.csv', index=False)