Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement additional methods for feature normalization #41

Open
Tracked by #38
ypriverol opened this issue Jan 8, 2024 · 1 comment
Open
Tracked by #38

Implement additional methods for feature normalization #41

ypriverol opened this issue Jan 8, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@ypriverol
Copy link
Member

@WangHong007 would be good to implement additional methods for features normalization apart from quantile. Quantile method is quite strong method, removing most of the variability across peptides. Would be good to enable other normalization methods (e.g median) which less impact on the data. Here a good research paper about peptide normalization

@ypriverol ypriverol added enhancement New feature or request good first issue Good for newcomers labels Jan 8, 2024
@WangHong007
Copy link
Contributor

We tried several normalization method: msstats, qnorm (fast quantile), and now additional MedScale. Here are some points we need to focus:

  1. For now, the object of normalization we apply is peptidoform (feature), and the normalization of peptide is optional.
  2. In peptidoform normalization, should remove_low_frequency_peptides first, followed by peptidoform selection and polymerization, and finally normalized? When comparing several normalization methods, the global standard deviation increased after these steps.
  3. We dropna many times in this process, but when we do the PivotTable, it produced many null values, which affects the result of the aggregate function.
  4. What are the effects of fractions, biological replication, conditions, and Run in the sample? For example, biological repeats could map to samples (one to one), it will increase impossible combinations if any other index value shouldn’t appear in this biological replication. And that’s how pandas pivot_table works. Here before normalization (except msstats normalization):
    normalize_df = pd.pivot_table(
    dataset,
    index=[
    PEPTIDE_SEQUENCE,
    PEPTIDE_CANONICAL,
    PEPTIDE_CHARGE,
    FRACTION,
    RUN,
    BIOREPLICATE,
    PROTEIN_NAME,
    STUDY_ID,
    CONDITION,
    ],
    columns=class_field,
    values=field,
    aggfunc={field: np.mean},
    observed=True,
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants