Skip to content
This repository has been archived by the owner on Mar 24, 2023. It is now read-only.

Snippets of data science reusable components in Python

License

Notifications You must be signed in to change notification settings

codenamewei/pydata-science-playground

Repository files navigation

ARCHIVED

🔥🔥🔥
🔥🔥🔥
THIS REPOSITORY IS NOW ARCHIVED AND CONTINUE UPDATING AT
https://github.com/whitepawglobal/bite-size-python
🔥🔥🔥
🔥🔥🔥

Snippets of Code for Data Science Operations in Python

datascienceplaygronud

project status: active

Environment Setup

Create environment (Only for the first time)

git clone https://github.com/codenamewei/pydata-science-playground.git
cd <path-to>/pydata-science-playground
conda env create -f config.yml

Activate environment

conda activate pyplayground

Package Installation

Install package with pip
pip install <package-name>. Example:pip install numpy

Install package with conda

conda install <package>. Example: conda install numpy

Bite-Size Python

Basic

Comment

  • Single Line Comment: //sample text
  • Multi Lines Comment:
     """
     Hello World!
     Nice to meet all of you cookie monsters!
     """
    

Boolean Operator

Data Types

Floating Value (float, double)

  • Format floating value to n decimal: "%.2f" % floating_var

Bytes

Notes:

Difference between bytes() and bytearray() is that bytes() returns an object that cannot be modified (immutable), 
and bytearray() returns an object that can be modified (mutable).

ByteArray

Notes:

Difference between bytes() and bytearray() is that bytes() returns an object that cannot be modified (immutable), 
and bytearray() returns an object that can be modified (mutable).

Numpy

  • Numpy basic
  • Get numpy shape: nparray.shape
  • Numpy array to list: nparray.tolist()
  • Change datatype: nparray = nparray.astype(<dtype>) Example: nparray = nparray.astype("uint8")
  • Numpy NaN (Not A Number): Constant to act as a placeholder for any missing numerical values in the array: np.NaN / np.nan / np.NAN
  • Numpy multiply by a value: nparray = nparray * 255
  • Numpy array to image
  • Numpy <> Binary File(.npy)
  • Use of numpy.where

String

  • Generate string with parameter
  • Check if string is empty, len = 0: if not strvar:
  • Check if string contains digit: any(chr.isdigit() for chr in str1) #return True if there's digit
  • Check file extension: notebooks/string/check_file_extension.ipynb
  • Capitalize a string: strvar.capitalize()
  • Uppercase a string: strvar.upper()
  • Lowercase a string: strvar.lower()
  • Get substring from a string: strvar[<begin-index>:<end-index>] / strvar[<begin-index>:] / strvar[:<end-index>]
  • Remove white spaces in the beginning and end: strvar.strip()
  • Swap existing upper and lower case: strvar.swapcase()
  • Capitalize every first letter of a word: strvar.title()
  • Splitting string:
    • Split a string based on separator: strvar.split(separator) Example: strvar.split("x")
    • Split on white space: strvar.split()
    • If split with every character, do this instead: [*"ABCDE"] Result: ["A", "B", "C", "D", "E"]
  • Check if string starts with a substring: strvar.startswith(<substring>)
  • Check if string ends with a substring: strvar.endswith(<substring>)
  • Check if string have substring/specific character. Returns -1 if not found: strvar.find(<substring>)
  • String get substring with index: str[startindex:endindex]
  • Replace string/character with intended string/character: strout = strin.replace(" ", "_")
  • Replace multiple string/characters with intended string/character
  • Generate random string
  • List to string: <separators>.join(list) example: ', '.join(listbuffer)

Unique Identifer (UUID)

Datetime

Data Structure

  • List of str to int: list(map(int, arr))
  • List with range of values: list(range(...))
  • Split str to list of str: arr.split(" ")
  • Check for empty list: if not mylist:
  • Find if a value in a list: if value in mylist: / if value not in mylist:
  • Sort an array in place: arr.sort() / Return a sorted array: sorted(arr)
  • Get index of a value: arr.index(value)
  • Add one more value to existing list: arr.append(value)
  • Extend list with values in another list: arr.extend(arr2)
  • Remove an item from the list: arr.remove(item)
  • Check for empty list: arr = []; if not arr: #empty list
  • Check all items in a list(subset) if exist in another list, returns boolean: set(b).issubset(v)
  • Build list of same values: ['100'] * 20 # 20 items of the value '100'
  • Change values of list with List Comprehension: [func(a) for a in sample_list]
  • Iteration of list with index: for index, value in enumerate(inlist):
  • Iteration over two lists: `[ for item1, item2 in zip(list1, list2)]```
  • Count occurence of items in list
  • Get maximum value in a list of numbers (even strings): max(samplelist)

Named Tuple

Applicable to Python Iterables (List, Set,...)

  • To identify if any items in the iterables has True/1 values: any(sample_list) #returns single value True/False
  • Zip multiple iterables

Panda Infos

Panda Operations

Panda Type

Panda Series

  • Series to value
  • Series/Dataframe to numpy array: input.to_numpy()
  • Series iteration: for index, item in seriesf.items():
  • Series to dict: seriesf.to_dict()

Panda Assign values

Panda Remove/drop values

Panda SQL-like functions

Panda Filtering

Panda Excel In/Out

Panda CSV In/Out

  • Read csv with other delimiter pd.read_csv(<path-to-file>, delimiter = '\x01')
  • Read csv with bad lines pd.read_csv(<path-to-file>, on_bad_lines='skip')
    • Note: pd.read_csv(<path>, error_bad_lines = False) deprecated
  • Read csv with encoding pd.read_csv('file name', encoding = 'utf-8')
  • Save to csv df.to_csv('file name', index = False)
    • Note: Put index = False is important to prevent an extra column of index being saved.
  • Save to csv with encoding df.to_csv('file name', encoding = 'utf-8')

Panda JSON In/Out

  • Read in parquet: pd.read_parquet(...)
  • Write to parquet: pd.to_parquet(...)

Panda Pickle In/Out

Note: Pickle have security risk and slow in serialization (even to csv and json). Dont use

  • Read in pickle to dataframe: df = pd.read_pickle(<file_name>) # ends with .pkl
  • Save to pickle: df.to_pickle(<file_name>)

Panda Dataframe Others

Random

  • Generate random integer within (min, max): from random import randint; randint(0, 100) #within 0 and 100
  • Generate random floating value: from random import random; random()
  • Randomly choosing an item out from a list: import random; random.choice([123, 456, 378])
  • Generate list with random number: import random; random.sample(range(10, 30), 5)
    • Example shown where 5 random numbers are generated in between 10 to 30

Intermediate

Error Handling

  • The character used by the operating system to separate pathname components: os.sep

  • Iterate through a path to get files/folders of all the subpaths

  • Write file: f.write(str)

  • print without new line: print(..., end="")

  • Get environment path (second param is optional): import os; os.getenv(<PATH_NAME> : str, <alternative-return-value>: str)

  • Flush out print

  • Check if path is a folder: os.path.isdir(<path>)

  • Get file size

    • from pathlib import Path; outsize : int = Path(inputfilepath).stat().st_size
    • import os; outsize : int = os.path.getsize(inputfilepath)
  • Create folder: os.mkdir(<path>

  • Create folders recursively: os.makedirs(<path>)

  • Get folder path out of given path with filename: os.path.dirname(<path-to-file>)

  • Expand home directory: os.path.expanduser('~')

  • Get current running script path: os.getcwd()

  • Get the list of all files and directories in the specified directory (does not expand to items in the child folder: os.listdir(<path>)

  • Get current file path (getcwd will point to the running script(main) path, this will get individually py path): os.path.dirname(os.path.abspath(__file__))

  • Get filename from path: os.path.basename(configfilepath)

  • Split extension from rest of path(Including .): filename, ext = os.path.splitext(path)

  • Append certain path: sys.path.append(<path>)

  • Check if path exist: os.path.exists(<path>)

  • Remove a file: os.remove()

  • Get size of current file in byte: os.path.getsize(<path>) or from pathlib import Path; Path(<path>).stat().st_size

  • Removes an empty directory: os.rmdir()

  • Deletes a directory and all its contents: shutil.rmtree()

  • Copy a file to another path

  • Unzip file

  • Readfile

    open(<path-to-file>, mode)
    
    - `r`: Open for text file for reading text - `w`: Open a text file for writing text - `a`: Open a text file for appending text - [`b`: Open to read/write as bytes](notebooks/cv/image_as_byte.ipynb) Read file has 3 functions
    • read() or read(size): read all / size as one string.
    • readline(): read a single line from a text file and return the line as a string.
    • readlines(): read all the lines of the text file into a list of strings.
    • write(<param> : str): write in param. Need to explicitly add \n to split line.
    • .close(): close file iterator

System

Time

  • Measure time prior and after
  • Add delay to execution of the program by pausing: import time;time.sleep(seconds)
    • Note: stops the execution of current thread only

Advanced

Class

Magic Method

Data Structure - Processing iterables with a functional style

Note: Functional style can be replaced with list comprehension or generator expressions

Inheritance

Passing variables in from command line

ConfigParser

XML Parser

URL

Performance

Multiprocessing

Logging

Built-In Logging

Logging Others

Design Patterns

Type Checking, Data Validation

Others

Networking

  • Get IP from domain name: import socket;socker.gethostbyname("www.google.com");

Concurrency

Built-in Concurrency Library: Asyncio

Hashing

Web

Software Development

REST

FastAPI

Requests

Database

Cloud

AWS

Note:

  • What is a bucket in S3

    A bucket is a container for objects stored in Amazon S3 which can contains files and folders. You can store any number of objects in a bucket and can have up to 100 buckets in your account

Machine Learning

Pytorch

  • Check if cuda is available - import torch; torch.cuda.is_available()
  • Softmax

Torch Tensor

Torch Tensor Creation

  • Create tensor of zeros with shape like another tensor: torch.zeros_like(another_tensor)
  • Create tensor of zeros with shape (tuple): torch.zeros(shape_in_tuple)
  • Create tensor of ones with shape like another tensor: torch.ones_like(another_tensor)
  • Create tensor of ones with shape (tuple): torch.ones(shape_in_tuple)
  • Create tensor of random floating value between 0-1 with shape like another tensor:
    torch.rand_like(another_tensor, dtype = torch.float)
  • Create tensor of random floating value between 0-1 with shape (tuple):
    torch.rand(shape_in_tuple)

Torch Tensor Info Extraction

  • Given torch.tensor buffer = tensor(4), get the value by - id = buffer.item()
  • Given torch.tensor, get the argmax of each row - torch.argmax(buffer, dim=<(int)dimension_to_reduce>)
  • Tensor to cuda - inputs = inputs.to("cuda")
  • Tensor shape - tensor.shape
  • Tensor data types - tensor.dtype
  • Device tensor is stored on - tensor.device
  • Torch tensor(single value) to value: tensorarray.item()
  • Retrieve subset of torch tensor by row index: tensor[<row_number>, :] / tensor[<row_number_from>:<row_number_to>, :]
  • Retrieve subset of torch tensor by column index: tensor[:, <column_number_from>:<column_number_to>]

Torch Tensor Conversion

Torch Tensor Operation

Dataset Loader, Iterator

  • torch.utils.data.DataLoader: stores the samples and their corresponding labels,
  • torch.utils.data.Dataset: wraps an iterable around the Dataset to enable easy access to the samples

Torch Tensor In/Out

Torch Dataset

  • Image Datasets

    • Fashion MNIST Torch

      Fashion-MNIST is a dataset of Zalando’s article images consisting of 60,000 training examples and 10,000 test examples. Each example comprises a 28×28 grayscale image and an associated label from one of 10 classes.

  • Text Datasets

  • Audio Datasets

Huggingface

Computer Vision

Computer Vision - Basic

  • Get image shape: img.shape
  • Create a color image: image = np.zeros((h,w,3), np.uint8)
  • Read/Write image:
  • Read image from url
  • Pause to display image or wait for an input: cv2.waitKey(0)
  • Save an image: cv2.imwrite(pathtoimg : str, img : numpy.ndarray)
  • Show an image in window: cv2.imshow(windowname : str, frame : np.array)
  • Show an image in Jupyter notebok
    from IPython.display import Image
    Image(filename=pathtoimg : str)
    
  • Flip image: frame = cv2.flip(frame, flipcode : int)
    • Positive flip code for flip on y axis (left right flip)
    • 0 for flip on x axis (up down)
    • Negative for flipping around both axes

Computer Vision - Intermediate

Computer Vision - Filter

  • Blur with averaging mask: cv2.blur(img,(5,5))
  • GaussianBlur: blur = cv2.GaussianBlur(img,(5,5),0)
    • Note: Kernel size (5, 5) to be positive and odd. Read more here on how kernel size influence the degree of blurring.
  • Blurring region of image

Computer Vision - Video Stream

Computer Vision - Other

Medium Posts

About

Snippets of data science reusable components in Python

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published