## Built In Data sets
- they provide data sets for practise cases

### pydatasets
- Wraps the R “datasets” package for Python — contains famous datasets like iris, mtcars, Titanic, etc
- Install : pip install pydataset
- https://pydataset.readthedocs.io/en/latest/

In [1]:
from pydataset import data

In [2]:
# List all datasets
print(data())

        dataset_id                                             title
0    AirPassengers       Monthly Airline Passenger Numbers 1949-1960
1          BJsales                 Sales Data with Leading Indicator
2              BOD                         Biochemical Oxygen Demand
3     Formaldehyde                     Determination of Formaldehyde
4     HairEyeColor         Hair and Eye Color of Statistics Students
..             ...                                               ...
752        VerbAgg                  Verbal Aggression item responses
753           cake                 Breakage Angle of Chocolate Cakes
754           cbpp                 Contagious bovine pleuropneumonia
755    grouseticks  Data on red grouse ticks from Elston et al. 2001
756     sleepstudy       Reaction times in a sleep deprivation study

[757 rows x 2 columns]


In [3]:
dataset_ids = data()['dataset_id'].tolist()
# Join into a comma-separated string
dataset_ids_str = ", ".join(dataset_ids)
print(dataset_ids_str)

AirPassengers, BJsales, BOD, Formaldehyde, HairEyeColor, InsectSprays, JohnsonJohnson, LakeHuron, LifeCycleSavings, Nile, OrchardSprays, PlantGrowth, Puromycin, Titanic, ToothGrowth, UCBAdmissions, UKDriverDeaths, UKgas, USAccDeaths, USArrests, USJudgeRatings, USPersonalExpenditure, VADeaths, WWWusage, WorldPhones, airmiles, airquality, anscombe, attenu, attitude, austres, cars, chickwts, co2, crimtab, discoveries, esoph, euro, faithful, freeny, infert, iris, islands, lh, longley, lynx, morley, mtcars, nhtemp, nottem, npk, occupationalStatus, precip, presidents, pressure, quakes, randu, rivers, rock, sleep, stackloss, sunspot.month, sunspot.year, sunspots, swiss, treering, trees, uspop, volcano, warpbreaks, women, acme, aids, aircondit, aircondit7, amis, aml, bigcity, brambles, breslow, calcium, cane, capability, catsM, cav, cd4, channing, city, claridge, cloth, co.transfer, coal, darwin, dogs, downs.bc, ducks, fir, frets, grav, gravity, hirose, islay, manaus, melanoma, motor, neuro, n

In [4]:
# another way to wrap lines
import textwrap
wrapped = textwrap.fill(dataset_ids_str, width=100)
print(wrapped)

AirPassengers, BJsales, BOD, Formaldehyde, HairEyeColor, InsectSprays, JohnsonJohnson, LakeHuron,
LifeCycleSavings, Nile, OrchardSprays, PlantGrowth, Puromycin, Titanic, ToothGrowth, UCBAdmissions,
UKDriverDeaths, UKgas, USAccDeaths, USArrests, USJudgeRatings, USPersonalExpenditure, VADeaths,
WWWusage, WorldPhones, airmiles, airquality, anscombe, attenu, attitude, austres, cars, chickwts,
co2, crimtab, discoveries, esoph, euro, faithful, freeny, infert, iris, islands, lh, longley, lynx,
morley, mtcars, nhtemp, nottem, npk, occupationalStatus, precip, presidents, pressure, quakes,
randu, rivers, rock, sleep, stackloss, sunspot.month, sunspot.year, sunspots, swiss, treering,
trees, uspop, volcano, warpbreaks, women, acme, aids, aircondit, aircondit7, amis, aml, bigcity,
brambles, breslow, calcium, cane, capability, catsM, cav, cd4, channing, city, claridge, cloth,
co.transfer, coal, darwin, dogs, downs.bc, ducks, fir, frets, grav, gravity, hirose, islay, manaus,
melanoma, motor, neuro, n

In [5]:
# Load iris dataset
iris = data('iris')
print(iris.head())

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
1           5.1          3.5           1.4          0.2  setosa
2           4.9          3.0           1.4          0.2  setosa
3           4.7          3.2           1.3          0.2  setosa
4           4.6          3.1           1.5          0.2  setosa
5           5.0          3.6           1.4          0.2  setosa


In [6]:
mtcars = data('mtcars')
mtcars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


### sklearn
- Comes with classic ML datasets (Iris, Wine, Digits, Boston Housing, etc.).
- install : pip install scikit-learn

In [7]:
from sklearn import datasets

In [8]:
# Load iris as NumPy arrays
iris2 = datasets.load_iris()
print(iris2.data[:5])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [16]:
# Or as pandas DataFrame
import pandas as pd
df2 = pd.DataFrame(iris2.data, columns=iris2.feature_names)
print(df2.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2


### seaborn (sns.load_dataset)
- Has several clean, ready-to-use DataFrames for visualization practice.
- Install : pip install seaborn

In [18]:
import seaborn as sns

In [19]:
# List available datasets
print(sns.get_dataset_names())

['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'dowjones', 'exercise', 'flights', 'fmri', 'geyser', 'glue', 'healthexp', 'iris', 'mpg', 'penguins', 'planets', 'seaice', 'taxis', 'tips', 'titanic']


In [20]:
tips = sns.load_dataset('tips')
print(tips.head())

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4


### statsmodels.dataset
- Many econometrics/statistics datasets.
- Install : pip install statsmodels

In [21]:
import statsmodels.api as sm

In [22]:
# Example: get the longley dataset
dataset = sm.datasets.longley.load_pandas().data
print(dataset.head())

    TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0  60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1  61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2  60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3  61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4  63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0


### Tensor Flow
- Large ML datasets, including images, text, and audio.
- Install : pip install tensorflow tensorflow-datasets

In [24]:
#pip install tensorflow tensorflow-datasets

### Keras datasets
- Handy for deep learning practice (MNIST, CIFAR-10, IMDB, etc.).
- Install : pip install tensorflow

### plotly.data
- Sample datasets for interactive plotting.
- Install : pip install plotly

In [27]:
import plotly.express as px
import plotly.data as data

In [28]:
gapminder = data.gapminder()
print(gapminder.head())

       country continent  year  lifeExp       pop   gdpPercap iso_alpha  \
0  Afghanistan      Asia  1952   28.801   8425333  779.445314       AFG   
1  Afghanistan      Asia  1957   30.332   9240934  820.853030       AFG   
2  Afghanistan      Asia  1962   31.997  10267083  853.100710       AFG   
3  Afghanistan      Asia  1967   34.020  11537966  836.197138       AFG   
4  Afghanistan      Asia  1972   36.088  13079460  739.981106       AFG   

   iso_num  
0        4  
1        4  
2        4  
3        4  
4        4  


## How to load data from Online Data sets 

### Google Sheets
- GS
    - https://docs.google.com/spreadsheets/d/19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4/edit?gid=764977169#gid=764977169
- SheetID
    - 19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4
- GID for orders sheet
    - 764977169
- Link
    - https://docs.google.com/spreadsheets/d/19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4/export?format=csv&gid=764977169
    - 'https://docs.google.com/spreadsheets/d/' + sheetid + '//export?format=csv&gid=' + gid

In [5]:
gs1 = 'https://docs.google.com/spreadsheets/d/'
sheetid = '19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4'
gs2 = '/export?format=csv&gid='
gid = '764977169'
gsurl = gs1 + sheetid + gs2 + gid
print(gsurl)

https://docs.google.com/spreadsheets/d/19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4/export?format=csv&gid=764977169


In [6]:
import pandas as pd
orders_df = orders_df = pd.read_csv(gsurl)
orders_df.shape

(9994, 21)

In [9]:
# Check if the sheets is reachable
import requests
requests.head(gsurl, allow_redirects=True, timeout =10) # if response is 200, it is reachable

<Response [200]>

In [None]:
gs1 = 'https://docs.google.com/spreadsheets/d/'
sheetid = '19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4'
gs2 = '/export?format=csv&gid='
gidO = '764977169' #orders
gidR = '675685971' #Returns
gidP = '1527535056' #People
gsurlO = gs1 + sheetid + gs2 + gidO
gsurlR = gs1 + sheetid + gs2 + gidR
gsurlP = gs1 + sheetid + gs2 + gidP

print(gsurlO, gsurlR, gsurlP, end ='\n', sep ='\n')

In [None]:
orders = pd.read_csv(gsurlO, parse_dates=["Order Date","Ship Date"])
returns = pd.read_csv(gsurlR)
people  = pd.read_csv(gsurlP)

In [None]:
# Check for rows and column counts
print('Orders Table - ', orders.shape, '\nReturn Table - ',  returns.shape, '\nPeople Table - ', people.shape)

# Git Hub Repository for datasets
- https://github.com/DUanalytics/datasets

### Online Excel 

In [11]:
#pip install xlrd   #install this engine
# For .xls files, pandas needs xlrd

In [13]:
import pandas as pd
url = "https://raw.githubusercontent.com/DUanalytics/datasets/master/excel/clustering-vanilla.xls"

In [16]:
# pip install xlrd
df = pd.read_excel(url, engine="xlrd")   # or omit engine if your env auto-detects

In [15]:
df.head()

Unnamed: 0,Offer #,Campaign,Varietal,Minimum Qty (kg),Discount (%),Origin,Past Peak
0,1,January,Malbec,72,56,France,False
1,2,January,Pinot Noir,72,17,France,False
2,3,February,Espumante,144,32,Oregon,True
3,4,February,Champagne,72,48,France,True
4,5,February,Cabernet Sauvignon,144,44,New Zealand,True


## Online CSV
- https://raw.githubusercontent.com/DUanalytics/datasets/refs/heads/master/csv/denco.csv

In [17]:
urlcsv ='https://raw.githubusercontent.com/DUanalytics/datasets/refs/heads/master/csv/denco.csv'
dfcsv = pd.read_csv(urlcsv)

In [18]:
dfcsv.head()

Unnamed: 0,custname,region,partnum,revenue,cost,margin
0,3M COMPANY,01-East,727032005,24097.5,19851.82,4245.68
1,4-STATE SUPPLY,01-East,735602000,156200.0,52381.38,103818.62
2,4-STATE SUPPLY,01-East,777143000,34927.2,15382.08,19545.12
3,4-STATE SUPPLY,01-East,777142000,21989.4,12562.5,9426.9
4,4-STATE SUPPLY,01-East,735750000,12487.0,3686.91,8800.09


## Kaggle
- https://www.kaggle.com/datasets
- https://www.kaggle.com/datasets/rohitgrewal/airlines-flights-data

In [21]:
#pip install kagglehub   #install this library

In [25]:
import kagglehub
import os
# Download latest version
path = kagglehub.dataset_download("rohitgrewal/airlines-flights-data")

In [23]:
print("Path to dataset files:", path)
# this will be the path in local files system where data is installed

Path to dataset files: /Users/du/.cache/kagglehub/datasets/rohitgrewal/airlines-flights-data/versions/1


In [28]:
# List files in the dataset folder
print(os.listdir(path))
# note this file name

['airlines_flights_data.csv']


In [30]:
# Suppose the CSV is named 'flights.csv' (replace with actual file name from above)
csv_file = os.path.join(path, "airlines_flights_data.csv")
print(csv_file)

/Users/du/.cache/kagglehub/datasets/rohitgrewal/airlines-flights-data/versions/1/airlines_flights_data.csv


In [32]:
# Load into Pandas
df = pd.read_csv(csv_file)
df.head()

Unnamed: 0,index,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955


In [33]:
df.shape

(300153, 12)

## Analysis in this dataset
- Questions
    - Q.1. What are the airlines in the dataset, accompanied by their frequencies?
    - Q.2. Show Bar Graphs representing the Departure Time & Arrival Time.
    - Q.3. Show Bar Graphs representing the Source City & Destination City.
    - Q.4. Does price varies with airlines ?
    - Q.5. Does ticket price change based on the departure time and arrival time?
    - Q.6. How the price changes with change in Source and Destination?
    - Q.7. How is the price affected when tickets are bought in just 1 or 2 days before departure?
    - Q.8. How does the ticket price vary between Economy and Business class?
    - Q.9. What will be the Average Price of Vistara airline for a flight from Delhi to Hyderabad in Business Class ?
- Features of the Columns
    - 1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
      2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
      3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
      4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
      5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
      6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
      7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
      8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
      9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
      10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
      11) Price: Target variable stores information of the ticket price.

## Data.Gov
- https://data.gov/
- https://catalog.data.gov/dataset/electric-vehicle-population-data

In [34]:
url1 = 'https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD'

In [35]:
df = pd.read_csv(url1)

In [36]:
df.head()

Unnamed: 0,VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
0,5YJSA1E65N,Yakima,Granger,WA,98932.0,2022,TESLA,MODEL S,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0.0,0.0,15.0,187279214,POINT (-120.1871 46.33949),PACIFICORP,53077000000.0
1,KNDC3DLC5N,Yakima,Yakima,WA,98902.0,2022,KIA,EV6,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0.0,0.0,15.0,210098241,POINT (-120.52041 46.59751),PACIFICORP,53077000000.0
2,5YJYGDEEXL,Snohomish,Everett,WA,98208.0,2020,TESLA,MODEL Y,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,291.0,0.0,44.0,121781950,POINT (-122.18637 47.89251),PUGET SOUND ENERGY INC,53061040000.0
3,3C3CFFGE1G,Yakima,Yakima,WA,98908.0,2016,FIAT,500,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,84.0,0.0,14.0,180778377,POINT (-120.60199 46.59817),PACIFICORP,53077000000.0
4,KNDCC3LD5K,Kitsap,Bremerton,WA,98312.0,2019,KIA,NIRO,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,26.0,0.0,26.0,2581225,POINT (-122.65223 47.57192),PUGET SOUND ENERGY INC,53035080000.0


In [37]:
df.shape

(250659, 17)

## Now we have enough sources of data
- we can load data from local laptop
- xx