[What is Pandas](https://www.youtube.com/live/zCDVUyq8lkw?si=5PU7SsP9Q7HRQxuF&t=557)

***Series and DataFrame are two prominent data structures in pandas***

[Pandas Series explained](https://www.youtube.com/live/zCDVUyq8lkw?si=otnRlAXbsMqv4rUd&t=711) 

In [1]:
import numpy as np
import pandas as pd

# creating pandas series, and see the video for explanation -> https://www.youtube.com/live/zCDVUyq8lkw?si=tFNlGmNIANJ6RIeO&t=897

# string type series
labels = ['a', 'b', 'c']
pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

In [2]:
# integer type series
my_list = np.arange(100,301,100)
pd.Series(data=my_list)

0    100
1    200
2    300
dtype: int64

In [3]:
# custom index

languages = ['Python', 'JS', 'C++','C']
pd.Series(data=languages, index=['first', 'second', 'third', 'fourth'])

first     Python
second        JS
third        C++
fourth         C
dtype: object

In [4]:
# setting a name a series object, then refer it by that name

langs = pd.Series(data=languages, index=['first', 'second', 'third', 'fourth'],name='Programming Languages')


[Name your columns , if data have not column names, with `name` parameter of `pd.read_csv`](https://youtu.be/a_XrmKlaGTs?si=wbGmZLBu0LoU_CK5&t=727)

## Series From dict 

In [5]:
programming_languages = {
    'first': 'JS',
    'second': 'Python',
    'third': 'C++',
    'fourth': 'C'
}

pd.Series(programming_languages) # so key becomes index and values becomes data


first         JS
second    Python
third        C++
fourth         C
dtype: object

## Series Attributes

In [6]:
langs 

first     Python
second        JS
third        C++
fourth         C
Name: Programming Languages, dtype: object

In [7]:
# size, number of items in the series

langs.size


4

In [8]:
# dtype

langs.dtype

dtype('O')

In [9]:
# name

langs.name

'Programming Languages'

In [10]:
# is_unique, tells if all values are unique

langs.is_unique

True

In [11]:
# index, for all index values

langs.index

Index(['first', 'second', 'third', 'fourth'], dtype='object')

In [12]:
# values, return numpy array of all values in the series
langs.values

array(['Python', 'JS', 'C++', 'C'], dtype=object)

## Series using `read_csv`

In [20]:
# with one col
import requests
from io import StringIO
import pandas as pd

# Use the raw URL, not the GitHub webpage URL
url = "https://raw.githubusercontent.com/codeforamerica/ohana-api/master/data/sample-csv/addresses.csv"
headers = {"User-Agent": "Mozilla/5.0"}

# Make the request
req = requests.get(url, headers=headers)

# Convert to CSV using StringIO
data = StringIO(req.text)

# Load into pandas DataFrame
df = pd.read_csv(data)

# Show the first few rows
df.to_csv('file1.csv') # for Export CSV to a Working Directory
df = pd.read_csv('file1.csv')  # read_Csv by default pulls data in `DataFrame` data type
df

Unnamed: 0.1,Unnamed: 0,id,location_id,address_1,address_2,city,state_province,postal_code,country
0,0,1,1,2600 Middlefield Road,,Redwood City,CA,94063,US
1,1,2,2,24 Second Avenue,,San Mateo,CA,94401,US
2,2,3,3,24 Second Avenue,,San Mateo,CA,94403,US
3,3,4,4,24 Second Avenue,,San Mateo,CA,94401,US
4,4,5,5,24 Second Avenue,,San Mateo,CA,94401,US
5,5,6,6,800 Middle Avenue,,Menlo Park,CA,94025-9881,US
6,6,7,7,500 Arbor Road,,Menlo Park,CA,94025,US
7,7,8,8,800 Middle Avenue,,Menlo Park,CA,94025-9881,US
8,8,9,9,2510 Middlefield Road,,Redwood City,CA,94063,US
9,9,10,10,1044 Middlefield Road,,Redwood City,CA,94063,US


In [35]:
# with one column, so we have to convert this data to series, we can do with squeeze parameter

print(df['address_1'].squeeze())

print()
print(type(df['address_1'].squeeze()))

print(type(df['address_1'])) # by the way it automaticallly makes series on indexing of specific column

# second way of converting to series 
# pd.read_csv('file1.csv', index_col="city").iloc[:, 0] 

0          2600 Middlefield Road
1               24 Second Avenue
2               24 Second Avenue
3               24 Second Avenue
4               24 Second Avenue
5              800 Middle Avenue
6                 500 Arbor Road
7              800 Middle Avenue
8          2510 Middlefield Road
9          1044 Middlefield Road
10           2140 Euclid Avenue.
11         1044 Middlefield Road
12           399 Marine Parkway.
13            660 Veterans Blvd.
14          1500 Valencia Street
15           1161 South Bernardo
16       409 South Spruce Avenue
17              114 Fifth Avenue
18           19 West 39th Avenue
19            123 El Camino Real
20    2013 Avenue of the fellows
Name: address_1, dtype: object

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [None]:
# the benefit of giving name to data types - https://www.youtube.com/live/zCDVUyq8lkw?si=Y02AbkaecZALGGQv&t=2307
# the name of that column becomes name of series that is value of that series 

In [37]:
pd.read_csv('file1.csv', index_col="city").squeeze()

city
Menlo Park             94025-9881
Menlo Park             94025-9881
San Mateo                   94403
San Mateo                   94403
San Mateo                   94401
San Mateo                   94401
San Mateo                   94401
San Francisco               94110
San Francisco               94103
Sunnyvale                   94087
South San Francisco         94080
Redwood City                94065
Redwood City                94063
Redwood City                94063
Redwood City                94063
Redwood City                94063
Redwood City                94063
Redwood City                94063
Redwood City                94061
Menlo Park                  94025
Belmont                     94002
Name: postal_code, dtype: object

## Series Methods

In [76]:
# head and tail (to get a preview of a data)
df = pd.read_csv('file1.csv', index_col="postal_code").iloc[:, 0]
print(df.head(),'\n') # first 5 rows



print(df.head(2),'\n') # first 2 rows

# tail is just means n last last rows, by default it is 5
print(df.tail(1)) # last 1 row

postal_code
94025-9881    Menlo Park
94025-9881    Menlo Park
94403          San Mateo
94403          San Mateo
94401          San Mateo
Name: city, dtype: object 

postal_code
94025-9881    Menlo Park
94025-9881    Menlo Park
Name: city, dtype: object 

postal_code
94002    Belmont
Name: city, dtype: object


In [77]:
# sample, pull n rows(by default 1) randomly from the data
df.sample(1)

postal_code
94025-9881    Menlo Park
Name: city, dtype: object

In [None]:
# value_counts, basically tells frequency of values
df.value_counts()

city
Redwood City           8
San Mateo              5
Menlo Park             3
San Francisco          2
Sunnyvale              1
South San Francisco    1
Belmont                1
Name: count, dtype: int64

In [84]:
# sort_values, sort the data, by default it is in ascending order
df.sort_values().head(1) # asecending=False for descending order


postal_code
94002    Belmont
Name: city, dtype: object