<div style="text-align:center;font-size:22pt; font-weight:bold;color:white;border:solid black 1.5pt;background-color:#1e7263;">
    Reading Comma Separated Value Data (CSV)
</div>

In [1]:
# ============================================================
#                                                            =
#             Title: Reading CSV Data with Polars            =
#             ---------------------------------              =
#                                                            =
#             Author: Dr. Saad Laouadi                       =
#                                                            =
#             Copyright: Dr. Saad Laouadi                    =
# ============================================================
#                                                            =
#                       LICENSE                              =
#             ----------------------                         =
#                                                            =
#             This material is intended for educational      =
#             purposes only and may not be used directly in  =
#             courses, video recordings, or similar          =
#             without prior consent from the author.         =
#             When using or referencing this material,       =
#             proper credit must be attributed to the        =
#             author.                                        =
# ============================================================

In [2]:
# Environment Setup
import sys
sys.path.append('../../scripts/')  

# import the working libraries
from importlibs import *

******************************************
          The imported libs are:          
******************************************
polars version is :     0.20.2
pandas version is :      2.1.4
numpy version is  :     1.26.2
pyarrow version is:     14.0.2
******************************************
The imported builtin modules are:
['os', 'sys', 'pathlib', 'time', 'shutil', 're']
**************************************************************
The python executable path is:
 /usr/local/Caskroom/mambaforge/base/envs/plenv/bin/python3.12
**************************************************************
Important Reminder:
Before proceeding, please ensure that you have activated the appropriate virtual environment for this project.
This step is crucial to maintain consistent dependencies and project settings.


### The `pl.read_csv()` Method

The `read_csv` method in the Polars library is used for reading Comma-Separated Values (CSV) files into a DataFrame. It's a versatile function with several arguments that allow you to handle various aspects of CSV data loading. Here are some of the main arguments of the read_csv method in Polars:

1. **file:**
    - Type: str or file-like object
    - Description: The filepath or URL to the CSV file, or any object with a read() method (like a file handle).
Example: "data.csv" or open("data.csv")
2. **sep:**
    - Type: str, default is ','
    - Description: The delimiter to use for separating values. By default, it's a comma, but it can be set to any character.
```python
Example: sep='\t' for tab-separated values
```

3. **has_header:**
    - Type: bool, default is True
    - Description: Indicates if the first row of the CSV file contains column headers. If False, columns will be unnamed and indexed numerically.

```python
Example: has_header=False
```

4. **columns:**
    - Type: Optional[List[str]]
    - Description: A list of column names to select from the CSV. If not specified, all columns are read.

```python
Example: columns=['col1', 'col2']
```

5. **dtype:**
    - Type: Optional[Dict[str, DataType]] or DataType
    - Description: A dictionary mapping column names to Polars DataTypes, or a single DataType to apply to all columns. Used to explicitly set the data type of columns.

```python
Example: dtype={'col1': pl.Int32, 'col2': pl.Float64}
```

6. **null_values:**
    - Type: Optional[str, List[str]]
    - Description: Values to interpret as missing/NA. It can be a single string or a list of strings.

```python
Example: null_values="NA"
```

7. **skip_rows:**
    - Type: int, default is 0
    - Description: Number of rows to skip at the start of the file.

```python
Example: skip_rows=1
```

8. **n_rows:**
    - Type: Optional[int]
    - Description: Number of rows to read. Useful for reading chunks of large files.

```python
Example: n_rows=100
```

9. **encoding:**
    - Type: str, default is 'utf8'
    - Description: The character encoding of the file.

```python
Example: encoding='latin1'
```

10. **low_memory:**
    - Type: bool, default is False
    - Description: Tries to reduce memory usage at the cost of performance.

```python
Example: low_memory=True
```

- These arguments provide flexibility in handling various scenarios encountered while working with CSV data, such as different file formats, missing values, and specific data types. It's worth noting that the default settings of read_csv work well for standard CSV files, but these arguments can be adjusted for more complex data structures or specific requirements.

In [2]:
# print(help(pl.DataFrame))

In [3]:
pl.DataFrame

polars.dataframe.frame.DataFrame

In [4]:
df = pl.DataFrame()

In [5]:
for meth in dir(df):
    if not meth.startswith('_'):
        print(meth)

apply
approx_n_unique
bottom_k
cast
clear
clone
columns
corr
describe
drop
drop_in_place
drop_nulls
dtypes
equals
estimated_size
explode
extend
fill_nan
fill_null
filter
find_idx_by_name
flags
fold
frame_equal
gather_every
get_column
get_column_index
get_columns
glimpse
group_by
group_by_dynamic
group_by_rolling
groupby
groupby_dynamic
groupby_rolling
hash_rows
head
height
hstack
insert_at_idx
insert_column
interpolate
is_duplicated
is_empty
is_unique
item
iter_columns
iter_rows
iter_slices
join
join_asof
lazy
limit
map_rows
max
max_horizontal
mean
mean_horizontal
median
melt
merge_sorted
min
min_horizontal
n_chunks
n_unique
null_count
partition_by
pipe
pivot
product
quantile
rechunk
rename
replace
replace_at_idx
replace_column
reverse
rolling
row
rows
rows_by_key
sample
schema
select
select_seq
set_sorted
shape
shift
shift_and_fill
shrink_to_fit
slice
sort
std
sum
sum_horizontal
tail
take_every
to_arrow
to_dict
to_dicts
to_dummies
to_init_repr
to_numpy
to_pandas
to_series
to_struct
to

In [6]:
# for meth in dir(pl):
#     if not meth.startswith('_'):
#         print(meth)

Array
ArrowError
Binary
Boolean
Categorical
ColumnNotFoundError
ComputeError
Config
DATETIME_DTYPES
DURATION_DTYPES
DataFrame
DataType
Date
Datetime
Decimal
DuplicateError
Duration
Enum
Expr
FLOAT_DTYPES
Field
Float32
Float64
INTEGER_DTYPES
Int16
Int32
Int64
Int8
InvalidOperationError
LazyFrame
List
NESTED_DTYPES
NUMERIC_DTYPES
NoDataError
Null
Object
OutOfBoundsError
PolarsDataType
PolarsPanicError
SQLContext
SchemaError
SchemaFieldNotFoundError
Series
ShapeError
StringCache
Struct
StructFieldNotFoundError
TEMPORAL_DTYPES
Time
UInt16
UInt32
UInt64
UInt8
Unknown
Utf8
align_frames
all
all_horizontal
any
any_horizontal
api
apply
approx_n_unique
arange
arctan2
arctan2d
arg_sort_by
arg_where
build_info
coalesce
col
collect_all
collect_all_async
concat
concat_list
concat_str
config
contextlib
convert
corr
count
cov
cum_fold
cum_reduce
cum_sum
cum_sum_horizontal
cumfold
cumreduce
cumsum
cumsum_horizontal
dataframe
datatypes
date
date_range
date_ranges
datetime
datetime_range
datetime_ranges


In [7]:
?pl.polars

[0;31mType:[0m        module
[0;31mString form:[0m <module 'polars.polars' from '/usr/local/Caskroom/mambaforge/base/envs/plenv/lib/python3.12/site-packages/polars/polars.abi3.so'>
[0;31mFile:[0m        /usr/local/Caskroom/mambaforge/base/envs/plenv/lib/python3.12/site-packages/polars/polars.abi3.so

In [8]:
from pathlib import Path

In [9]:
data = Path("../Data/").resolve()
print(data)

/Users/daas/TutoringAndTeaching/DataScienceWithPythonPath/03. Polars Path/Polars Path/Data


In [10]:
temperature_path = data.joinpath("city_temperature.csv")

In [11]:
print(temperature_path)

/Users/daas/TutoringAndTeaching/DataScienceWithPythonPath/03. Polars Path/Polars Path/Data/city_temperature.csv


In [12]:
import time 

In [13]:
start = time.time()
city_temp = pl.read_csv(temperature_path)
end = time.time()

In [14]:
print(end - start)

0.4049038887023926


In [15]:
city_temp.schema

OrderedDict([('Region', Utf8),
             ('Country', Utf8),
             ('State', Utf8),
             ('City', Utf8),
             ('Month', Int64),
             ('Day', Int64),
             ('Year', Int64),
             ('AvgTemperature', Float64)])

In [16]:
city_temp.columns

['Region',
 'Country',
 'State',
 'City',
 'Month',
 'Day',
 'Year',
 'AvgTemperature']

In [17]:
city_temp.to_pandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906327 entries, 0 to 2906326
Data columns (total 8 columns):
 #   Column          Dtype  
---  ------          -----  
 0   Region          object 
 1   Country         object 
 2   State           object 
 3   City            object 
 4   Month           int64  
 5   Day             int64  
 6   Year            int64  
 7   AvgTemperature  float64
dtypes: float64(1), int64(3), object(4)
memory usage: 177.4+ MB


In [20]:
print(city_temp.head())

shape: (5, 8)
┌────────┬─────────┬───────┬─────────┬───────┬─────┬──────┬────────────────┐
│ Region ┆ Country ┆ State ┆ City    ┆ Month ┆ Day ┆ Year ┆ AvgTemperature │
│ ---    ┆ ---     ┆ ---   ┆ ---     ┆ ---   ┆ --- ┆ ---  ┆ ---            │
│ str    ┆ str     ┆ str   ┆ str     ┆ i64   ┆ i64 ┆ i64  ┆ f64            │
╞════════╪═════════╪═══════╪═════════╪═══════╪═════╪══════╪════════════════╡
│ Africa ┆ Algeria ┆ null  ┆ Algiers ┆ 1     ┆ 1   ┆ 1995 ┆ 64.2           │
│ Africa ┆ Algeria ┆ null  ┆ Algiers ┆ 1     ┆ 2   ┆ 1995 ┆ 49.4           │
│ Africa ┆ Algeria ┆ null  ┆ Algiers ┆ 1     ┆ 3   ┆ 1995 ┆ 48.8           │
│ Africa ┆ Algeria ┆ null  ┆ Algiers ┆ 1     ┆ 4   ┆ 1995 ┆ 46.4           │
│ Africa ┆ Algeria ┆ null  ┆ Algiers ┆ 1     ┆ 5   ┆ 1995 ┆ 47.9           │
└────────┴─────────┴───────┴─────────┴───────┴─────┴──────┴────────────────┘


In [21]:
import pandas as pd

In [23]:
start = time.time()
df = pd.read_csv(temperature_path, low_memory=False)
end = time.time()
print(end - start)

3.7848238945007324


In [24]:
city_temp.select(['Country', 'AvgTemperature'])

Country,AvgTemperature
str,f64
"""Algeria""",64.2
"""Algeria""",49.4
"""Algeria""",48.8
"""Algeria""",46.4
"""Algeria""",47.9
"""Algeria""",48.7
"""Algeria""",48.9
"""Algeria""",49.1
"""Algeria""",49.0
"""Algeria""",51.9


In [25]:
pl.DataFrame().with_columns().alias?

[0;31mDocstring:[0m
Define an alias for a system command.

'%alias alias_name cmd' defines 'alias_name' as an alias for 'cmd'

Then, typing 'alias_name params' will execute the system command 'cmd
params' (from your underlying operating system).

Aliases have lower precedence than magic functions and Python normal
variables, so if 'foo' is both a Python variable and an alias, the
alias can not be executed until 'del foo' removes the Python variable.

You can use the %l specifier in an alias definition to represent the
whole line when the alias is called.  For example::

  In [2]: alias bracket echo "Input in brackets: <%l>"
  In [3]: bracket hello world
  Input in brackets: <hello world>

You can also define aliases with parameters using %s specifiers (one
per parameter)::

  In [1]: alias parts echo first %s second %s
  In [2]: %parts A B
  first A second B
  In [3]: %parts A
  Incorrect number of arguments: 2 expected.
  parts is an alias to: 'echo first %s second %s'

Note that %l

In [None]:
pl.col