[What is Pandas](https://www.youtube.com/live/zCDVUyq8lkw?si=5PU7SsP9Q7HRQxuF&t=557)

***Series and DataFrame are two prominent data structures in pandas***

[Pandas Series explained](https://www.youtube.com/live/zCDVUyq8lkw?si=otnRlAXbsMqv4rUd&t=711) 

In [170]:
import numpy as np
import pandas as pd

# creating pandas series, and see the video for explanation -> https://www.youtube.com/live/zCDVUyq8lkw?si=tFNlGmNIANJ6RIeO&t=897

# string type series
labels = ['a', 'b', 'c']
pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

In [171]:
# integer type series
my_list = np.arange(100,301,100)
pd.Series(data=my_list)

0    100
1    200
2    300
dtype: int64

In [172]:
# custom index

languages = ['Python', 'JS', 'C++','C']
pd.Series(data=languages, index=['first', 'second', 'third', 'fourth'])

first     Python
second        JS
third        C++
fourth         C
dtype: object

In [173]:
# setting a name a series object, then refer it by that name

langs = pd.Series(data=languages, index=['first', 'second', 'third', 'fourth'],name='Programming Languages')
langs

first     Python
second        JS
third        C++
fourth         C
Name: Programming Languages, dtype: object

[Name your columns , if data have not column names, with `name` parameter of `pd.read_csv`](https://youtu.be/a_XrmKlaGTs?si=wbGmZLBu0LoU_CK5&t=727)

## Series From dict 

In [174]:
programming_languages = {
    'first': 'JS',
    'second': 'Python',
    'third': 'C++',
    'fourth': 'C'
}

pd.Series(programming_languages) # so key becomes index and values becomes data


first         JS
second    Python
third        C++
fourth         C
dtype: object

## Series Attributes

In [175]:
langs 

first     Python
second        JS
third        C++
fourth         C
Name: Programming Languages, dtype: object

In [176]:
# size, number of items in the series

langs.size


4

In [177]:
# dtype

langs.dtype

dtype('O')

In [178]:
# name

langs.name

'Programming Languages'

In [179]:
# is_unique, tells if all values are unique

langs.is_unique

True

In [180]:
# index, for all index values

langs.index

Index(['first', 'second', 'third', 'fourth'], dtype='object')

In [181]:
# values, return numpy array of all values in the series
langs.values

array(['Python', 'JS', 'C++', 'C'], dtype=object)

## Series using `read_csv`

In [182]:
# with one col
import requests
from io import StringIO
import pandas as pd

# Use the raw URL, not the GitHub webpage URL
url = "https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv"
headers = {"User-Agent": "Mozilla/5.0"}

# Make the request
req = requests.get(url, headers=headers)

# Convert to CSV using StringIO
data = StringIO(req.text)

# Load into pandas DataFrame
df = pd.read_csv(data)

# Show the first few rows
df.to_csv('file1.csv') # for Export CSV to a Working Directory
df = pd.read_csv('file1.csv')  # read_Csv by default pulls data in `DataFrame` data type
df

Unnamed: 0.1,Unnamed: 0,id,location_id,address_1,address_2,city,state_province,postal_code,country
0,0,1,1,2600 Middlefield Road,,Redwood City,CA,94063,US
1,1,2,2,24 Second Avenue,,San Mateo,CA,94401,US
2,2,3,3,24 Second Avenue,,San Mateo,CA,94403,US
3,3,4,4,24 Second Avenue,,San Mateo,CA,94401,US
4,4,5,5,24 Second Avenue,,San Mateo,CA,94401,US
5,5,6,6,800 Middle Avenue,,Menlo Park,CA,94025-9881,US
6,6,7,7,500 Arbor Road,,Menlo Park,CA,94025,US
7,7,8,8,800 Middle Avenue,,Menlo Park,CA,94025-9881,US
8,8,9,9,2510 Middlefield Road,,Redwood City,CA,94063,US
9,9,10,10,1044 Middlefield Road,,Redwood City,CA,94063,US


In [183]:
# with one column, so we have to convert this data to series, we can do with squeeze parameter

print(df['address_1'].squeeze())

print()
print(type(df['address_1'].squeeze()))

print(type(df['address_1'])) # by the way it automaticallly makes series on indexing of specific column

# second way of converting to series 
# pd.read_csv('file1.csv', index_col="city").iloc[:, 0] 

0          2600 Middlefield Road
1               24 Second Avenue
2               24 Second Avenue
3               24 Second Avenue
4               24 Second Avenue
5              800 Middle Avenue
6                 500 Arbor Road
7              800 Middle Avenue
8          2510 Middlefield Road
9          1044 Middlefield Road
10           2140 Euclid Avenue.
11         1044 Middlefield Road
12           399 Marine Parkway.
13            660 Veterans Blvd.
14          1500 Valencia Street
15           1161 South Bernardo
16       409 South Spruce Avenue
17              114 Fifth Avenue
18           19 West 39th Avenue
19            123 El Camino Real
20    2013 Avenue of the fellows
Name: address_1, dtype: object

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [184]:
# the benefit of giving name to data types - https://www.youtube.com/live/zCDVUyq8lkw?si=Y02AbkaecZALGGQv&t=2307
# the name of that column becomes name of series that is value of that series 

In [185]:
# index_col is used to set particular given column(here: city) as index 
pd.read_csv('file1.csv', index_col="city").squeeze()

Unnamed: 0_level_0,Unnamed: 0,id,location_id,address_1,address_2,state_province,postal_code,country
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Redwood City,0,1,1,2600 Middlefield Road,,CA,94063,US
San Mateo,1,2,2,24 Second Avenue,,CA,94401,US
San Mateo,2,3,3,24 Second Avenue,,CA,94403,US
San Mateo,3,4,4,24 Second Avenue,,CA,94401,US
San Mateo,4,5,5,24 Second Avenue,,CA,94401,US
Menlo Park,5,6,6,800 Middle Avenue,,CA,94025-9881,US
Menlo Park,6,7,7,500 Arbor Road,,CA,94025,US
Menlo Park,7,8,8,800 Middle Avenue,,CA,94025-9881,US
Redwood City,8,9,9,2510 Middlefield Road,,CA,94063,US
Redwood City,9,10,10,1044 Middlefield Road,,CA,94063,US


## Series Methods

In [186]:
# head and tail (to get a preview of a data)
df = pd.read_csv('file1.csv', index_col="postal_code").iloc[:, 0]
print(df.head(),'\n') # first 5 rows



print(df.head(2),'\n') # first 2 rows

# tail is just means n last last rows, by default it is 5
print(df.tail(1)) # last 1 row

postal_code
94063    0
94401    1
94403    2
94401    3
94401    4
Name: Unnamed: 0, dtype: int64 

postal_code
94063    0
94401    1
Name: Unnamed: 0, dtype: int64 

postal_code
94103    20
Name: Unnamed: 0, dtype: int64


In [187]:
# sample, pull n rows(by default 1) randomly from the data
df.sample(1)

postal_code
94110    14
Name: Unnamed: 0, dtype: int64

In [188]:
# value_counts, basically tells frequency of values
file1 = pd.read_csv('file1.csv', index_col="address_1").iloc[:, 4]


print(file1.value_counts())
print()
print(file1)

city
Redwood City           8
San Mateo              5
Menlo Park             3
San Francisco          2
Sunnyvale              1
South San Francisco    1
Belmont                1
Name: count, dtype: int64

address_1
2600 Middlefield Road                Redwood City
24 Second Avenue                        San Mateo
24 Second Avenue                        San Mateo
24 Second Avenue                        San Mateo
24 Second Avenue                        San Mateo
800 Middle Avenue                      Menlo Park
500 Arbor Road                         Menlo Park
800 Middle Avenue                      Menlo Park
2510 Middlefield Road                Redwood City
1044 Middlefield Road                Redwood City
2140 Euclid Avenue.                  Redwood City
1044 Middlefield Road                Redwood City
399 Marine Parkway.                  Redwood City
660 Veterans Blvd.                   Redwood City
1500 Valencia Street                San Francisco
1161 South Bernardo              

In [189]:
# sort_values, sort the data, by default it is in ascending order
df.sort_values() # asecending=False for descending order, inplace=True for change in the original series

postal_code
94063          0
94401          1
94403          2
94401          3
94401          4
94025-9881     5
94025          6
94025-9881     7
94063          8
94063          9
94061         10
94063         11
94065         12
94063         13
94110         14
94087         15
94080         16
94063         17
94403         18
94002         19
94103         20
Name: Unnamed: 0, dtype: int64

In [190]:
df.sort_index() # it is basically sort indexes

postal_code
94002         19
94025          6
94025-9881     5
94025-9881     7
94061         10
94063         17
94063         13
94063         11
94063          9
94063          0
94063          8
94065         12
94080         16
94087         15
94103         20
94110         14
94401          4
94401          3
94401          1
94403          2
94403         18
Name: Unnamed: 0, dtype: int64

## Series Math Methods


In [191]:
# count, same as size but not counts missing values
df.count()

np.int64(21)

In [192]:
# sum of all values of the series, product of all values of the series
df.sum()

np.int64(210)

In [193]:
# mean,median,mode,std,var (of all values of the series)
seriesD = pd.Series([1,2,3,4,5,6,7,8,9,10,1])
print('mean \n',seriesD.mean())
print('median \n',seriesD.median())
print('mode \n',seriesD.mode()) # it give all if all values has same frequency
print('standard deviation \n',seriesD.std())  # by default sample standard deviation, for population specify `(ddof=0)` parameter
print('variance \n',seriesD.var()) # by default sample standard deviation, for population specify `(ddof=0)`  parameter

mean 
 5.090909090909091
median 
 5.0
mode 
 0    1
dtype: int64
standard deviation 
 3.1766191290283903
variance 
 10.09090909090909


In [194]:
# min/max
print(seriesD.min())
print(seriesD.max())

1
10


In [195]:
# describe , includes count,mean,std,min,25%,50%,75%,max
""" 25% means 25% of the data is before or equal to 2.5, 50% means 50% of the data is before 5 or equal to 5, 75% 
of the data is equal to 7.5 or after 7.5 """
seriesD.describe() # 11 are non null(not-missing) values 

count    11.000000
mean      5.090909
std       3.176619
min       1.000000
25%       2.500000
50%       5.000000
75%       7.500000
max      10.000000
dtype: float64

## Series Indexing

In [196]:
# integer indexing
seriesX = pd.Series([7,18,19,100],index=['one','two','three','four'])
# x[0] integer index deprecates , instead  use x.iloc[0] , so use `iloc` for integer based indexing
print(seriesX['one'])
print(seriesX.iloc[-1])


7
100


In [197]:
# series not work on negative indexing, if values are not strings and [integer] is deprecated use `.iloc` instead


In [None]:
# slicing, .iloc is best practice
print(seriesX.iloc[::-1])
# you can also slice through indexes of string like this
seriesX['one':'three'] # but including value of 'three' (means value after `:`


new      999
four       0
three      0
two        0
one        7
dtype: int64


one      7
two      0
three    0
dtype: int64

In [199]:
# fancy indexing 
print(seriesX.iloc[[0,1]])

print('\n',seriesX[['one','four']])

one     7
two    18
dtype: int64

 one       7
four    100
dtype: int64


## Editing Series

In [200]:
# using indexing
seriesX = pd.Series([7,18,19,100],index=['one','two','three','four'])
seriesX[1] = 7
seriesX['four'] = 200

  seriesX[1] = 7


In [201]:
# if index not exist, it will create one
seriesX['new'] = 999
seriesX

one        7
two        7
three     19
four     200
new      999
dtype: int64

In [212]:
# editing through slicing
seriesX['two':'four'] = 0  # including value of `four`
print(seriesX) 

# or using `seriesX.iloc[1:4] = 0`
seriesX.iloc[1:3] = 111
seriesX

one        7
two        0
three      0
four       0
new      999
dtype: int64


one        7
two      111
three    111
four       0
new      999
dtype: int64

In [214]:
# through fancy indexing
seriesX[['two','new']] = 991
seriesX

one        7
two      991
three    999
four       0
new      991
dtype: int64

In [216]:
# using index label
seriesX['new'] = 9999
seriesX

one         7
two       991
three     999
four        0
new      9999
dtype: int64