# File Handling

### Python supports a wide range of file formats. Here are some common examples:

#### `Text files:`

Plain text files (.txt).

CSV files (.csv).

JSON files (.json).

XML files (.xml).

HTML files (.html).


`Binary files:`

Images: (.jpg,.png,.gif,.bmp,.tiff)

Audio files (.mp3,.wav,.FLAC)

Video files (.mp4,.avi,.mkv).

PDF files (.pdf).

Excel files (.xls,.xlsx).

Word documents (.doc,.docx).

Compressed files (.zip,.tar,.gz).


#### `Specialised formats:`

SQLite Databases (.sqlite,.db).

HDF5 files (.h5).

Pickle Files (.pkl)

YAML files (.yaml,.yml).


#### `Python-Specific Files:`

Python source code (.py).

Python Compiled Files (.pyc).

Jupyter Notebook (.ipynb)


#### Python is versatile and can handle many types of files, the most fundamental ones are text and binary files

###### Text files are files that contain plain text and can be read and written with the open function, which has modes like 'r' for reading and 'w' for writing. Example file extensions:.txt,.csv, and.json.

###### Binary files contain binary data and can be read and written using the open function, but in different modes, such as 'rb' for reading binary and 'wb' for writing binary. For example: Images (.jpg,.png) and audio files (.mp3,.wav).).




# File vs File Object

### File
A file is a document or a piece of information that can be created, modified and stored by a user or an operating system.

In Python, a file can be either text or binary. In this course, we will only work with text files.

Breast_Cancer.csv

### File Object
A file object is a python object that holds data imported from a file. To use a file in our program, we first need to import it and convert it into a file object. Once it is a file object, we can modify it as needed. This object contains methods and properties for dealing with the file, including reading from, writing to, and closing it.

file_path = 'C:/Users/user/Desktop/Breast_Cancer.csv'




# Reading and Parsing

### Reading
Reading is the process of obtaining and retrieving the contents of a file. 
### Parsing

Parsing often entails analysing the structure of the file's content and extracting important data into an organised format, such as lists, dictionaries, or data frames.


Reading a file entails examining its contents. Think of it as opening a book and reading the words on the pages.

Parsing is the process of making sense of information read from a file. It's similar to discovering the chapters and parts of a book and comprehending its structur 

When you parse a file, you understand its contents to extract the most significant information. 




# Data Structures

#### Data structures are methods of organising and storing data to be easily accessed and used. They determine the efficiency with which data is processed and stored. Data structures may increase speed, minimise resource use, and make code more readable and maintainable.


### Structured Data

Structured data refers to information that is organized in a way that makes finding specific pieces of information easy. This type of data is often in a tabular form, such as an Excel spreadsheet or SQL database. It can also be stored in databases and managed from one computer.

### Unstructured Data

Unstructured data refers to information that is not organized in a way that makes finding specific pieces of information easy. This type of data can include video, audio, photos, presentations, web pages, text, and more.

### Semi-Structured Data

Semi-structured data refers to a middle ground between structured and unstructured data. Semi-structured data does not have a rigid structure like structured data, but it still has some organizational properties, such as tags or markers to separate data elements. This type of data is more about using different patterns for storing and organizing the data in a way that makes it easier to access and analyze. 

Examples: 

JSON (JavaScript Object Notation), 

XML (eXtensible Markup Language)

HTML (HyperText Markup Language)

### Let's Look at a dataset from Kaggle

In [1]:
file_path = 'C:/Users/user/Desktop/Breast_Cancer.csv'


In [2]:
with open(file_path, 'r') as breast_cancer_data:
    data = breast_cancer_data.read()

In [3]:
print(data)


Age,Race,Marital Status,T Stage ,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,Alive
50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,Alive
58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,Alive
58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,18,Positive,Positive,2,1,84,Alive
47,White,Married,T2,N1,IIB,Poorly differentiated,3,Regional,41,Positive,Positive,3,1,50,Alive
51,White,Single ,T1,N1,IIA,Moderately differentiated,2,Regional,20,Positive,Positive,18,2,89,Alive
51,White,Married,T1,N1,IIA,Well differentiated,1,Regional,8,Positive,Positive,11,1,54,Alive
40,White,Married,T2,N1,IIB,Moderately differentiated,2,Regional,30,Positive,Positive,9,1,14,Dead
40,White,Divorced,T4,N3,IIIC,Poorly

### Using the csv modulue


In [4]:
import csv

file_path = 'C:/Users/user/Desktop/Breast_Cancer.csv'

with open(file_path, 'r') as breast_cancer_data:
    csv_reader = csv.reader(breast_cancer_data)
    
 
    for row in csv_reader:
        print(row)


['Age', 'Race', 'Marital Status', 'T Stage ', 'N Stage', '6th Stage', 'differentiate', 'Grade', 'A Stage', 'Tumor Size', 'Estrogen Status', 'Progesterone Status', 'Regional Node Examined', 'Reginol Node Positive', 'Survival Months', 'Status']
['68', 'White', 'Married', 'T1', 'N1', 'IIA', 'Poorly differentiated', '3', 'Regional', '4', 'Positive', 'Positive', '24', '1', '60', 'Alive']
['50', 'White', 'Married', 'T2', 'N2', 'IIIA', 'Moderately differentiated', '2', 'Regional', '35', 'Positive', 'Positive', '14', '5', '62', 'Alive']
['58', 'White', 'Divorced', 'T3', 'N3', 'IIIC', 'Moderately differentiated', '2', 'Regional', '63', 'Positive', 'Positive', '14', '7', '75', 'Alive']
['58', 'White', 'Married', 'T1', 'N1', 'IIA', 'Poorly differentiated', '3', 'Regional', '18', 'Positive', 'Positive', '2', '1', '84', 'Alive']
['47', 'White', 'Married', 'T2', 'N1', 'IIB', 'Poorly differentiated', '3', 'Regional', '41', 'Positive', 'Positive', '3', '1', '50', 'Alive']
['51', 'White', 'Single ', 'T

#### Using Numpy to import the data

In [5]:
import numpy as np

file_path = 'C:/Users/user/Desktop/Breast_Cancer.csv'


In [6]:
dataset = np.genfromtxt(file_path, delimiter=',')
print(dataset)


[[ nan  nan  nan ...  nan  nan  nan]
 [ 68.  nan  nan ...   1.  60.  nan]
 [ 50.  nan  nan ...   5.  62.  nan]
 ...
 [ 68.  nan  nan ...   3.  69.  nan]
 [ 58.  nan  nan ...   1.  72.  nan]
 [ 46.  nan  nan ...   2. 100.  nan]]


In [7]:
head = dataset[:5]
print(head)

[[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
 [68. nan nan nan nan nan nan  3. nan  4. nan nan 24.  1. 60. nan]
 [50. nan nan nan nan nan nan  2. nan 35. nan nan 14.  5. 62. nan]
 [58. nan nan nan nan nan nan  2. nan 63. nan nan 14.  7. 75. nan]
 [58. nan nan nan nan nan nan  3. nan 18. nan nan  2.  1. 84. nan]]


## Using Pandas

In [8]:
import pandas as pd

file_path = 'C:/Users/user/Desktop/Breast_Cancer.csv'

df = pd.read_csv(file_path)

print(df.head())


   Age   Race Marital Status T Stage  N Stage 6th Stage  \
0   68  White        Married       T1      N1       IIA   
1   50  White        Married       T2      N2      IIIA   
2   58  White       Divorced       T3      N3      IIIC   
3   58  White        Married       T1      N1       IIA   
4   47  White        Married       T2      N1       IIB   

               differentiate Grade   A Stage  Tumor Size Estrogen Status  \
0      Poorly differentiated     3  Regional           4        Positive   
1  Moderately differentiated     2  Regional          35        Positive   
2  Moderately differentiated     2  Regional          63        Positive   
3      Poorly differentiated     3  Regional          18        Positive   
4      Poorly differentiated     3  Regional          41        Positive   

  Progesterone Status  Regional Node Examined  Reginol Node Positive  \
0            Positive                      24                      1   
1            Positive                      1

In [9]:
df

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,Alive
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,Alive
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,Alive
3,58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,18,Positive,Positive,2,1,84,Alive
4,47,White,Married,T2,N1,IIB,Poorly differentiated,3,Regional,41,Positive,Positive,3,1,50,Alive
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4019,62,Other,Married,T1,N1,IIA,Moderately differentiated,2,Regional,9,Positive,Positive,1,1,49,Alive
4020,56,White,Divorced,T2,N2,IIIA,Moderately differentiated,2,Regional,46,Positive,Positive,14,8,69,Alive
4021,68,White,Married,T2,N1,IIB,Moderately differentiated,2,Regional,22,Positive,Negative,11,3,69,Alive
4022,58,Black,Divorced,T2,N1,IIB,Moderately differentiated,2,Regional,44,Positive,Positive,11,1,72,Alive


### Let's import directly from Kaggle

In [10]:
pip install Kaggle

Note: you may need to restart the kernel to use updated packages.


In [11]:
import os
from zipfile import ZipFile

In [12]:
!kaggle datasets download -d zynicide/wine-reviews

Traceback (most recent call last):
  File "C:\Users\user\anaconda3\Lib\site-packages\urllib3\connection.py", line 198, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\anaconda3\Lib\site-packages\urllib3\util\connection.py", line 85, in create_connection
    raise err
  File "C:\Users\user\anaconda3\Lib\site-packages\urllib3\util\connection.py", line 73, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\user\anaconda3\Lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\anaconda3\Lib\site-pac

In [13]:
# Extracting the downloaded zip file
with ZipFile('C:/Users/user/Desktop/New folder/Gomycode/wine-reviews.zip', 'r') as zip_ref:
    zip_ref.extractall('wine-reviews')



In [14]:
# Load the data using pandas or any other library
import pandas as pd
data = pd.read_csv('C:/Users/user/Desktop/New folder/Gomycode/wine-reviews/winemag-data_first150k.csv')


In [15]:
print(data)

        Unnamed: 0 country                                        description  \
0                0      US  This tremendous 100% varietal wine hails from ...   
1                1   Spain  Ripe aromas of fig, blackberry and cassis are ...   
2                2      US  Mac Watson honors the memory of a wine once ma...   
3                3      US  This spent 20 months in 30% new French oak, an...   
4                4  France  This is the top wine from La Bégude, named aft...   
...            ...     ...                                                ...   
150925      150925   Italy  Many people feel Fiano represents southern Ita...   
150926      150926  France  Offers an intriguing nose with ginger, lime an...   
150927      150927   Italy  This classic example comes from a cru vineyard...   
150928      150928  France  A perfect salmon shade, with scents of peaches...   
150929      150929   Italy  More Pinot Grigios should taste like this. A r...   

                           

In [16]:
print(data.head())


   Unnamed: 0 country                                        description  \
0           0      US  This tremendous 100% varietal wine hails from ...   
1           1   Spain  Ripe aromas of fig, blackberry and cassis are ...   
2           2      US  Mac Watson honors the memory of a wine once ma...   
3           3      US  This spent 20 months in 30% new French oak, an...   
4           4  France  This is the top wine from La Bégude, named aft...   

                            designation  points  price        province  \
0                     Martha's Vineyard      96  235.0      California   
1  Carodorum Selección Especial Reserva      96  110.0  Northern Spain   
2         Special Selected Late Harvest      96   90.0      California   
3                               Reserve      96   65.0          Oregon   
4                            La Brûlade      95   66.0        Provence   

            region_1           region_2             variety  \
0        Napa Valley               