## Built In Data sets
- they provide data sets for practise cases

### pydatasets
- Wraps the R “datasets” package for Python — contains famous datasets like iris, mtcars, Titanic, etc
- Install : pip install pydataset
- https://pydataset.readthedocs.io/en/latest/

In [40]:
#pip install pydataset  #comment after install ; Install only once, use library as required

In [41]:
from pydataset import data

In [42]:
# List all datasets
print(data())

        dataset_id                                             title
0    AirPassengers       Monthly Airline Passenger Numbers 1949-1960
1          BJsales                 Sales Data with Leading Indicator
2              BOD                         Biochemical Oxygen Demand
3     Formaldehyde                     Determination of Formaldehyde
4     HairEyeColor         Hair and Eye Color of Statistics Students
..             ...                                               ...
752        VerbAgg                  Verbal Aggression item responses
753           cake                 Breakage Angle of Chocolate Cakes
754           cbpp                 Contagious bovine pleuropneumonia
755    grouseticks  Data on red grouse ticks from Elston et al. 2001
756     sleepstudy       Reaction times in a sleep deprivation study

[757 rows x 2 columns]


In [43]:
dataset_ids = data()['dataset_id'].tolist()
# Join into a comma-separated string
dataset_ids_str = ", ".join(dataset_ids)
print(dataset_ids_str)

AirPassengers, BJsales, BOD, Formaldehyde, HairEyeColor, InsectSprays, JohnsonJohnson, LakeHuron, LifeCycleSavings, Nile, OrchardSprays, PlantGrowth, Puromycin, Titanic, ToothGrowth, UCBAdmissions, UKDriverDeaths, UKgas, USAccDeaths, USArrests, USJudgeRatings, USPersonalExpenditure, VADeaths, WWWusage, WorldPhones, airmiles, airquality, anscombe, attenu, attitude, austres, cars, chickwts, co2, crimtab, discoveries, esoph, euro, faithful, freeny, infert, iris, islands, lh, longley, lynx, morley, mtcars, nhtemp, nottem, npk, occupationalStatus, precip, presidents, pressure, quakes, randu, rivers, rock, sleep, stackloss, sunspot.month, sunspot.year, sunspots, swiss, treering, trees, uspop, volcano, warpbreaks, women, acme, aids, aircondit, aircondit7, amis, aml, bigcity, brambles, breslow, calcium, cane, capability, catsM, cav, cd4, channing, city, claridge, cloth, co.transfer, coal, darwin, dogs, downs.bc, ducks, fir, frets, grav, gravity, hirose, islay, manaus, melanoma, motor, neuro, n

In [None]:
# another way to wrap lines
import textwrap
wrapped = textwrap.fill(dataset_ids_str, width=100)
print(wrapped)

In [44]:
# Load iris dataset
iris = data('iris')
print(iris.head())

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
1           5.1          3.5           1.4          0.2  setosa
2           4.9          3.0           1.4          0.2  setosa
3           4.7          3.2           1.3          0.2  setosa
4           4.6          3.1           1.5          0.2  setosa
5           5.0          3.6           1.4          0.2  setosa


In [45]:
mtcars = data('mtcars')
mtcars.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


### sklearn
- Comes with classic ML datasets (Iris, Wine, Digits, Boston Housing, etc.).
- install : pip install scikit-learn

In [None]:
from sklearn import datasets

In [None]:
# Load iris as NumPy arrays
iris2 = datasets.load_iris()
print(iris2.data[:5])

In [None]:
# Or as pandas DataFrame
import pandas as pd
df2 = pd.DataFrame(iris2.data, columns=iris2.feature_names)
print(df2.head())

### seaborn (sns.load_dataset)
- Has several clean, ready-to-use DataFrames for visualization practice.
- Install : pip install seaborn

In [None]:
import seaborn as sns

In [None]:
# List available datasets
print(sns.get_dataset_names())

In [None]:
tips = sns.load_dataset('tips')
print(tips.head())

### statsmodels.dataset
- Many econometrics/statistics datasets.
- Install : pip install statsmodels

In [None]:
import statsmodels.api as sm

In [None]:
# Example: get the longley dataset
dataset = sm.datasets.longley.load_pandas().data
print(dataset.head())

### Tensor Flow
- Large ML datasets, including images, text, and audio.
- Install : pip install tensorflow tensorflow-datasets

In [None]:
#pip install tensorflow tensorflow-datasets

### Keras datasets
- Handy for deep learning practice (MNIST, CIFAR-10, IMDB, etc.).
- Install : pip install tensorflow

### plotly.data
- Sample datasets for interactive plotting.
- Install : pip install plotly

In [None]:
import plotly.express as px
import plotly.data as data

In [None]:
gapminder = data.gapminder()
print(gapminder.head())

## How to load data from Online Data sets 

### Google Sheets
- GS
    - https://docs.google.com/spreadsheets/d/19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4/edit?gid=764977169#gid=764977169
- SheetID
    - 19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4
- GID for orders sheet
    - 764977169
- Link
    - https://docs.google.com/spreadsheets/d/19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4/export?format=csv&gid=764977169
    - 'https://docs.google.com/spreadsheets/d/' + sheetid + '//export?format=csv&gid=' + gid

In [None]:
gs1 = 'https://docs.google.com/spreadsheets/d/'
sheetid = '19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4'
gs2 = '/export?format=csv&gid='
gid = '764977169'
gsurl = gs1 + sheetid + gs2 + gid
print(gsurl)

In [None]:
import pandas as pd
orders_df = orders_df = pd.read_csv(gsurl)
orders_df.shape

In [None]:
# Check if the sheets is reachable
import requests
requests.head(gsurl, allow_redirects=True, timeout =10) # if response is 200, it is reachable

In [None]:
gs1 = 'https://docs.google.com/spreadsheets/d/'
sheetid = '19ReQlRfDQHcV1OFUnmVkiFY_1IrJeOR0g1RmrjfjMD4'
gs2 = '/export?format=csv&gid='
gidO = '764977169' #orders
gidR = '675685971' #Returns
gidP = '1527535056' #People
gsurlO = gs1 + sheetid + gs2 + gidO
gsurlR = gs1 + sheetid + gs2 + gidR
gsurlP = gs1 + sheetid + gs2 + gidP

print(gsurlO, gsurlR, gsurlP, end ='\n', sep ='\n')

In [None]:
orders = pd.read_csv(gsurlO, parse_dates=["Order Date","Ship Date"])
returns = pd.read_csv(gsurlR)
people  = pd.read_csv(gsurlP)

In [None]:
# Check for rows and column counts
print('Orders Table - ', orders.shape, '\nReturn Table - ',  returns.shape, '\nPeople Table - ', people.shape)

# Git Hub Repository for datasets
- https://github.com/DUanalytics/datasets

### Online Excel 

In [None]:
#pip install xlrd   #install this engine
# For .xls files, pandas needs xlrd

In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/DUanalytics/datasets/master/excel/clustering-vanilla.xls"

In [None]:
# pip install xlrd
df = pd.read_excel(url, engine="xlrd")   # or omit engine if your env auto-detects

In [None]:
df.head()

## Online CSV
- https://raw.githubusercontent.com/DUanalytics/datasets/refs/heads/master/csv/denco.csv

In [None]:
urlcsv ='https://raw.githubusercontent.com/DUanalytics/datasets/refs/heads/master/csv/denco.csv'
dfcsv = pd.read_csv(urlcsv)

In [None]:
dfcsv.head()

## Kaggle
- https://www.kaggle.com/datasets
- https://www.kaggle.com/datasets/rohitgrewal/airlines-flights-data

In [None]:
#pip install kagglehub   #install this library

In [None]:
import kagglehub
import os
# Download latest version
path = kagglehub.dataset_download("rohitgrewal/airlines-flights-data")

In [None]:
print("Path to dataset files:", path)
# this will be the path in local files system where data is installed

In [None]:
# List files in the dataset folder
print(os.listdir(path))
# note this file name

In [None]:
# Suppose the CSV is named 'flights.csv' (replace with actual file name from above)
csv_file = os.path.join(path, "airlines_flights_data.csv")
print(csv_file)

In [None]:
# Load into Pandas
df = pd.read_csv(csv_file)
df.head()

In [None]:
df.shape

## Analysis in this dataset
- Questions
    - Q.1. What are the airlines in the dataset, accompanied by their frequencies?
    - Q.2. Show Bar Graphs representing the Departure Time & Arrival Time.
    - Q.3. Show Bar Graphs representing the Source City & Destination City.
    - Q.4. Does price varies with airlines ?
    - Q.5. Does ticket price change based on the departure time and arrival time?
    - Q.6. How the price changes with change in Source and Destination?
    - Q.7. How is the price affected when tickets are bought in just 1 or 2 days before departure?
    - Q.8. How does the ticket price vary between Economy and Business class?
    - Q.9. What will be the Average Price of Vistara airline for a flight from Delhi to Hyderabad in Business Class ?
- Features of the Columns
    - 1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
      2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
      3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
      4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
      5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
      6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
      7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
      8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
      9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
      10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
      11) Price: Target variable stores information of the ticket price.

## Data.Gov
- https://data.gov/
- https://catalog.data.gov/dataset/electric-vehicle-population-data

In [None]:
url1 = 'https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD'

In [None]:
df = pd.read_csv(url1)

In [None]:
df.head()

In [None]:
df.shape

## Now we have enough sources of data
- we can load data from local laptop
- xx